readseg bug?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

readseg bug?

Florent Gluck
Hi all,

I've noticed that when doing a segment dump using readseg, several
instances of the same CrawlDatum can be present in a given record.
For example I have a segment with one single url (http://www.moma.org)
and here is the dump below.  I ran the following command:  nutch readseg
-dump segments/20070517113941 segdump -nocontent -noparsedata -noparsetext

Here is the first record:

Recno:: 0
URL:: http://www.moma.org/

CrawlDatum::
Version: 5
Status: 1 (db_unfetched)
Fetch time: Thu May 17 11:39:34 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: _ngt_:1179416381663

CrawlDatum::
Version: 5
Status: 65 (signature)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0.0 days
Score: 1.0
Signature: fe47b3db7c988541287fc6412ce0b923
Metadata: null

CrawlDatum::
Version: 5
Status: 33 (fetch_success)
Fetch time: Thu May 17 11:39:49 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: fe47b3db7c988541287fc6412ce0b923
Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0

Why are there 3 CrawlDatum fields?
I assumed there would be only one CrawlDatum with status 33 (fetch_success).
What is the purpose of the other two?

Now, here is the 5th record:

Recno:: 5
URL:: http://www.moma.org/application/x-shockwave-flash

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null


There are 6 CrawlDatum fields and all of them are exactly identical.
Is this a bug or am I missing something here?

Any light on this matter would be greatly appreciated.
Thank you.

Florent
Reply | Threaded
Open this post in threaded view
|

Re: readseg bug?

Doğacan Güney-3
Hi,

On 5/17/07, Florent Gluck <[hidden email]> wrote:
> Hi all,
>
> I've noticed that when doing a segment dump using readseg, several
> instances of the same CrawlDatum can be present in a given record.
> For example I have a segment with one single url (http://www.moma.org)
> and here is the dump below.  I ran the following command:  nutch readseg
> -dump segments/20070517113941 segdump -nocontent -noparsedata -noparsetext

With this command, readseg reads from crawl_{fetch,generate,parse}.

>
> Here is the first record:
>
> Recno:: 0
> URL:: http://www.moma.org/
>
> CrawlDatum::
> Version: 5
> Status: 1 (db_unfetched)
> Fetch time: Thu May 17 11:39:34 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 1.0
> Signature: null
> Metadata: _ngt_:1179416381663

This one is from crawl_generate, you can see that it contains a _ngt_
field. This datum is read by fetcher.

>
> CrawlDatum::
> Version: 5
> Status: 65 (signature)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: fe47b3db7c988541287fc6412ce0b923
> Metadata: null

This one is from crawl_parse. It contains signature of the parse text
which is used to dedup after index.

>
> CrawlDatum::
> Version: 5
> Status: 33 (fetch_success)
> Fetch time: Thu May 17 11:39:49 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 1.0
> Signature: fe47b3db7c988541287fc6412ce0b923
> Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0
>

This is from crawl_fetch.

> Why are there 3 CrawlDatum fields?
> I assumed there would be only one CrawlDatum with status 33 (fetch_success).
> What is the purpose of the other two?
>
> Now, here is the 5th record:
>
> Recno:: 5
> URL:: http://www.moma.org/application/x-shockwave-flash
>
> CrawlDatum::
> Version: 5
> Status: 67 (linked)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.03846154
> Signature: null
> Metadata: null
>
> CrawlDatum::
> Version: 5
> Status: 67 (linked)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.03846154
> Signature: null
> Metadata: null
>
> CrawlDatum::
> Version: 5
> Status: 67 (linked)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.03846154
> Signature: null
> Metadata: null
>
> CrawlDatum::
> Version: 5
> Status: 67 (linked)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.03846154
> Signature: null
> Metadata: null
>
> CrawlDatum::
> Version: 5
> Status: 67 (linked)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.03846154
> Signature: null
> Metadata: null
>
> CrawlDatum::
> Version: 5
> Status: 67 (linked)
> Fetch time: Thu May 17 11:39:51 EDT 2007
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.03846154
> Signature: null
> Metadata: null

In this case, a linked status indicates an outlink. Most likely your
url (http://www.moma.org) contains six distinct outlinks to
http://www.moma.org/application/x-shockwave-flash. Each of them is put
as a seperate entity to crawl_parse. This is used in updatedb to
(among other things) calculate score.

>
>
> There are 6 CrawlDatum fields and all of them are exactly identical.
> Is this a bug or am I missing something here?
>
> Any light on this matter would be greatly appreciated.
> Thank you.
>
> Florent
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: readseg bug?

Florent Gluck
Thank you for the explanation.  It was a bit confusing at first, but it
actually makes sense.

Florent

Doğacan Güney wrote:

> Hi,
>
> On 5/17/07, Florent Gluck <[hidden email]> wrote:
>> Hi all,
>>
>> I've noticed that when doing a segment dump using readseg, several
>> instances of the same CrawlDatum can be present in a given record.
>> For example I have a segment with one single url (http://www.moma.org)
>> and here is the dump below.  I ran the following command:  nutch readseg
>> -dump segments/20070517113941 segdump -nocontent -noparsedata
>> -noparsetext
>
> With this command, readseg reads from crawl_{fetch,generate,parse}.
>
>>
>> Here is the first record:
>>
>> Recno:: 0
>> URL:: http://www.moma.org/
>>
>> CrawlDatum::
>> Version: 5
>> Status: 1 (db_unfetched)
>> Fetch time: Thu May 17 11:39:34 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_:1179416381663
>
> This one is from crawl_generate, you can see that it contains a _ngt_
> field. This datum is read by fetcher.
>
>>
>> CrawlDatum::
>> Version: 5
>> Status: 65 (signature)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 0.0 days
>> Score: 1.0
>> Signature: fe47b3db7c988541287fc6412ce0b923
>> Metadata: null
>
> This one is from crawl_parse. It contains signature of the parse text
> which is used to dedup after index.
>
>>
>> CrawlDatum::
>> Version: 5
>> Status: 33 (fetch_success)
>> Fetch time: Thu May 17 11:39:49 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 1.0
>> Signature: fe47b3db7c988541287fc6412ce0b923
>> Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0
>>
>
> This is from crawl_fetch.
>
>> Why are there 3 CrawlDatum fields?
>> I assumed there would be only one CrawlDatum with status 33
>> (fetch_success).
>> What is the purpose of the other two?
>>
>> Now, here is the 5th record:
>>
>> Recno:: 5
>> URL:: http://www.moma.org/application/x-shockwave-flash
>>
>> CrawlDatum::
>> Version: 5
>> Status: 67 (linked)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.03846154
>> Signature: null
>> Metadata: null
>>
>> CrawlDatum::
>> Version: 5
>> Status: 67 (linked)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.03846154
>> Signature: null
>> Metadata: null
>>
>> CrawlDatum::
>> Version: 5
>> Status: 67 (linked)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.03846154
>> Signature: null
>> Metadata: null
>>
>> CrawlDatum::
>> Version: 5
>> Status: 67 (linked)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.03846154
>> Signature: null
>> Metadata: null
>>
>> CrawlDatum::
>> Version: 5
>> Status: 67 (linked)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.03846154
>> Signature: null
>> Metadata: null
>>
>> CrawlDatum::
>> Version: 5
>> Status: 67 (linked)
>> Fetch time: Thu May 17 11:39:51 EDT 2007
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 0.03846154
>> Signature: null
>> Metadata: null
>
> In this case, a linked status indicates an outlink. Most likely your
> url (http://www.moma.org) contains six distinct outlinks to
> http://www.moma.org/application/x-shockwave-flash. Each of them is put
> as a seperate entity to crawl_parse. This is used in updatedb to
> (among other things) calculate score.
>
>>
>>
>> There are 6 CrawlDatum fields and all of them are exactly identical.
>> Is this a bug or am I missing something here?
>>
>> Any light on this matter would be greatly appreciated.
>> Thank you.
>>
>> Florent
>>
>
>