Nutch indexes less pages, then it fetches

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Nutch indexes less pages, then it fetches

caezar
Hi All,

I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
I assume that if fetched sucessfully because in fetch logs it mentioned only once:
2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

皮皮
check the parse data first, maybe it parse unsuccessful.

2009/10/27 caezar <[hidden email]>

>
> Hi All,
>
> I've got a strange problem, that nutch indexes much less URLs then it
> fetches. For example URL:
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
> I assume that if fetched sucessfully because in fetch logs it mentioned
> only
> once:
> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>
> But it was not sent to the indexer on indexing phase (I'm using custom
> NutchIndexWriter and it logs every page for witch it's write method
> executed). What could be possible reason? Is there a way to browse crawldb
> to ensure that page really fetched? What else could I check?
>
> Thanks
> --
> View this message in context:
> http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

kevin chen-6
In reply to this post by caezar
I have similar experience.

Reinhard schwab responded a possible fix.  See mail in this group from
Reinhard schwab  at
Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

I haven't have chance to try it out yet.
 
On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:

> Hi All,
>
> I've got a strange problem, that nutch indexes much less URLs then it
> fetches. For example URL:
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
> I assume that if fetched sucessfully because in fetch logs it mentioned only
> once:
> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>
> But it was not sent to the indexer on indexing phase (I'm using custom
> NutchIndexWriter and it logs every page for witch it's write method
> executed). What could be possible reason? Is there a way to browse crawldb
> to ensure that page really fetched? What else could I check?
>
> Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

reinhard
what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>

it has this status, if it is recrawled and the signature does not change.
the signature is MD5 hash of the content.

another reason may be that you have some indexing filters.
i dont believe its the reason here.

regards


kevin chen schrieb:

> I have similar experience.
>
> Reinhard schwab responded a possible fix.  See mail in this group from
> Reinhard schwab  at
> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>
> I haven't have chance to try it out yet.
>  
> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>  
>> Hi All,
>>
>> I've got a strange problem, that nutch indexes much less URLs then it
>> fetches. For example URL:
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>> I assume that if fetched sucessfully because in fetch logs it mentioned only
>> once:
>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>
>> But it was not sent to the indexer on indexing phase (I'm using custom
>> NutchIndexWriter and it logs every page for witch it's write method
>> executed). What could be possible reason? Is there a way to browse crawldb
>> to ensure that page really fetched? What else could I check?
>>
>> Thanks
>>    
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
In reply to this post by 皮皮
Sorry, but how could I do this?
皮皮 wrote
check the parse data first, maybe it parse unsuccessful.

2009/10/27 caezar <caezaris@gmail.com>

>
> Hi All,
>
> I've got a strange problem, that nutch indexes much less URLs then it
> fetches. For example URL:
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
> I assume that if fetched sucessfully because in fetch logs it mentioned
> only
> once:
> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>
> But it was not sent to the indexer on indexing phase (I'm using custom
> NutchIndexWriter and it logs every page for witch it's write method
> executed). What could be possible reason? Is there a way to browse crawldb
> to ensure that page really fetched? What else could I check?
>
> Thanks
> --
> View this message in context:
> http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
In reply to this post by reinhard
Thanks, that was really helpful. I've moved forward but still not found the solution.
So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
Status: 5 (db_redir_perm)
Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

So it answers the question, why initial page was not indexed - because it was redirected.
Now checking the status of redirect target:
Status: 2 (db_fetched)

So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer!


reinhard schwab wrote
what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>

it has this status, if it is recrawled and the signature does not change.
the signature is MD5 hash of the content.

another reason may be that you have some indexing filters.
i dont believe its the reason here.

regards


kevin chen schrieb:
> I have similar experience.
>
> Reinhard schwab responded a possible fix.  See mail in this group from
> Reinhard schwab  at
> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>
> I haven't have chance to try it out yet.
>  
> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>  
>> Hi All,
>>
>> I've got a strange problem, that nutch indexes much less URLs then it
>> fetches. For example URL:
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>> I assume that if fetched sucessfully because in fetch logs it mentioned only
>> once:
>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>
>> But it was not sent to the indexer on indexing phase (I'm using custom
>> NutchIndexWriter and it logs every page for witch it's write method
>> executed). What could be possible reason? Is there a way to browse crawldb
>> to ensure that page really fetched? What else could I check?
>>
>> Thanks
>>    
>
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

reinhard
yes, its permanently redirected.
you can check also the segment status of this url
here is an example

reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
"http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"

it will show you whether it is parsed and the extracted outlinks.
it will show any data related to this url stored in the segment.

regards

caezar schrieb:

> Thanks, that was really helpful. I've moved forward but still not found the
> solution.
> So the status of the initial URL
> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
> Status: 5 (db_redir_perm)
> Metadata: _pst_: moved(12), lastModified=0:
> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>
> So it answers the question, why initial page was not indexed - because it
> was redirected.
> Now checking the status of redirect target:
> Status: 2 (db_fetched)
>
> So it was sucessfully fetchet. But, according to indexing log - it still was
> not sent to indexer!
>
>
>
> reinhard schwab wrote:
>  
>> what is the db status of this url in your crawl db?
>> if it is STATUS_DB_NOTMODIFIED,
>> then it may be the reason.
>> (you can check it if you dump your crawl db with
>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>
>> it has this status, if it is recrawled and the signature does not change.
>> the signature is MD5 hash of the content.
>>
>> another reason may be that you have some indexing filters.
>> i dont believe its the reason here.
>>
>> regards
>>
>>
>> kevin chen schrieb:
>>    
>>> I have similar experience.
>>>
>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>> Reinhard schwab  at
>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>
>>> I haven't have chance to try it out yet.
>>>  
>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>  
>>>      
>>>> Hi All,
>>>>
>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>> fetches. For example URL:
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>> I assume that if fetched sucessfully because in fetch logs it mentioned
>>>> only
>>>> once:
>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>
>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>> executed). What could be possible reason? Is there a way to browse
>>>> crawldb
>>>> to ensure that page really fetched? What else could I check?
>>>>
>>>> Thanks
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
Thanks, checked, it was parsed. Still no answer why it was not indexed
reinhard schwab wrote
yes, its permanently redirected.
you can check also the segment status of this url
here is an example

reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
"http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"

it will show you whether it is parsed and the extracted outlinks.
it will show any data related to this url stored in the segment.

regards

caezar schrieb:
> Thanks, that was really helpful. I've moved forward but still not found the
> solution.
> So the status of the initial URL
> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
> Status: 5 (db_redir_perm)
> Metadata: _pst_: moved(12), lastModified=0:
> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>
> So it answers the question, why initial page was not indexed - because it
> was redirected.
> Now checking the status of redirect target:
> Status: 2 (db_fetched)
>
> So it was sucessfully fetchet. But, according to indexing log - it still was
> not sent to indexer!
>
>
>
> reinhard schwab wrote:
>  
>> what is the db status of this url in your crawl db?
>> if it is STATUS_DB_NOTMODIFIED,
>> then it may be the reason.
>> (you can check it if you dump your crawl db with
>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>
>> it has this status, if it is recrawled and the signature does not change.
>> the signature is MD5 hash of the content.
>>
>> another reason may be that you have some indexing filters.
>> i dont believe its the reason here.
>>
>> regards
>>
>>
>> kevin chen schrieb:
>>    
>>> I have similar experience.
>>>
>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>> Reinhard schwab  at
>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>
>>> I haven't have chance to try it out yet.
>>>  
>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>  
>>>      
>>>> Hi All,
>>>>
>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>> fetches. For example URL:
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>> I assume that if fetched sucessfully because in fetch logs it mentioned
>>>> only
>>>> once:
>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>
>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>> executed). What could be possible reason? Is there a way to browse
>>>> crawldb
>>>> to ensure that page really fetched? What else could I check?
>>>>
>>>> Thanks
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

reinhard
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.

caezar schrieb:

> Thanks, checked, it was parsed. Still no answer why it was not indexed
>
> reinhard schwab wrote:
>  
>> yes, its permanently redirected.
>> you can check also the segment status of this url
>> here is an example
>>
>> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
>> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
>>
>> it will show you whether it is parsed and the extracted outlinks.
>> it will show any data related to this url stored in the segment.
>>
>> regards
>>
>> caezar schrieb:
>>    
>>> Thanks, that was really helpful. I've moved forward but still not found
>>> the
>>> solution.
>>> So the status of the initial URL
>>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>>> is:
>>> Status: 5 (db_redir_perm)
>>> Metadata: _pst_: moved(12), lastModified=0:
>>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>>
>>> So it answers the question, why initial page was not indexed - because it
>>> was redirected.
>>> Now checking the status of redirect target:
>>> Status: 2 (db_fetched)
>>>
>>> So it was sucessfully fetchet. But, according to indexing log - it still
>>> was
>>> not sent to indexer!
>>>
>>>
>>>
>>> reinhard schwab wrote:
>>>  
>>>      
>>>> what is the db status of this url in your crawl db?
>>>> if it is STATUS_DB_NOTMODIFIED,
>>>> then it may be the reason.
>>>> (you can check it if you dump your crawl db with
>>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>>
>>>> it has this status, if it is recrawled and the signature does not
>>>> change.
>>>> the signature is MD5 hash of the content.
>>>>
>>>> another reason may be that you have some indexing filters.
>>>> i dont believe its the reason here.
>>>>
>>>> regards
>>>>
>>>>
>>>> kevin chen schrieb:
>>>>    
>>>>        
>>>>> I have similar experience.
>>>>>
>>>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>>>> Reinhard schwab  at
>>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>>
>>>>> I haven't have chance to try it out yet.
>>>>>  
>>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>>  
>>>>>      
>>>>>          
>>>>>> Hi All,
>>>>>>
>>>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>>>> fetches. For example URL:
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>>> mentioned
>>>>>> only
>>>>>> once:
>>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>>> fetching
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>>
>>>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>>> executed). What could be possible reason? Is there a way to browse
>>>>>> crawldb
>>>>>> to ensure that page really fetched? What else could I check?
>>>>>>
>>>>>> Thanks
>>>>>>    
>>>>>>        
>>>>>>            
>>>>>  
>>>>>      
>>>>>          
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
In the IndexerMapReduce.reduce there is a code:
if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                   CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
          continue;
        }
And the status of the redirect target URL is really linked. Thats why it's skipped. But what does this status mean?
reinhard schwab wrote
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.

caezar schrieb:
> Thanks, checked, it was parsed. Still no answer why it was not indexed
>
> reinhard schwab wrote:
>  
>> yes, its permanently redirected.
>> you can check also the segment status of this url
>> here is an example
>>
>> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
>> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
>>
>> it will show you whether it is parsed and the extracted outlinks.
>> it will show any data related to this url stored in the segment.
>>
>> regards
>>
>> caezar schrieb:
>>    
>>> Thanks, that was really helpful. I've moved forward but still not found
>>> the
>>> solution.
>>> So the status of the initial URL
>>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>>> is:
>>> Status: 5 (db_redir_perm)
>>> Metadata: _pst_: moved(12), lastModified=0:
>>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>>
>>> So it answers the question, why initial page was not indexed - because it
>>> was redirected.
>>> Now checking the status of redirect target:
>>> Status: 2 (db_fetched)
>>>
>>> So it was sucessfully fetchet. But, according to indexing log - it still
>>> was
>>> not sent to indexer!
>>>
>>>
>>>
>>> reinhard schwab wrote:
>>>  
>>>      
>>>> what is the db status of this url in your crawl db?
>>>> if it is STATUS_DB_NOTMODIFIED,
>>>> then it may be the reason.
>>>> (you can check it if you dump your crawl db with
>>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>>
>>>> it has this status, if it is recrawled and the signature does not
>>>> change.
>>>> the signature is MD5 hash of the content.
>>>>
>>>> another reason may be that you have some indexing filters.
>>>> i dont believe its the reason here.
>>>>
>>>> regards
>>>>
>>>>
>>>> kevin chen schrieb:
>>>>    
>>>>        
>>>>> I have similar experience.
>>>>>
>>>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>>>> Reinhard schwab  at
>>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>>
>>>>> I haven't have chance to try it out yet.
>>>>>  
>>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>>  
>>>>>      
>>>>>          
>>>>>> Hi All,
>>>>>>
>>>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>>>> fetches. For example URL:
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>>> mentioned
>>>>>> only
>>>>>> once:
>>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>>> fetching
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>>
>>>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>>> executed). What could be possible reason? Is there a way to browse
>>>>>> crawldb
>>>>>> to ensure that page really fetched? What else could I check?
>>>>>>
>>>>>> Thanks
>>>>>>    
>>>>>>        
>>>>>>            
>>>>>  
>>>>>      
>>>>>          
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
Some more information. Debugging reduce method I've noticed, that before code
    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }
my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :)
Any ideas about the reason?
caezar wrote
In the IndexerMapReduce.reduce there is a code:
if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                   CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
          continue;
        }
And the status of the redirect target URL is really linked. Thats why it's skipped. But what does this status mean?
reinhard schwab wrote
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.

caezar schrieb:
> Thanks, checked, it was parsed. Still no answer why it was not indexed
>
> reinhard schwab wrote:
>  
>> yes, its permanently redirected.
>> you can check also the segment status of this url
>> here is an example
>>
>> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
>> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
>>
>> it will show you whether it is parsed and the extracted outlinks.
>> it will show any data related to this url stored in the segment.
>>
>> regards
>>
>> caezar schrieb:
>>    
>>> Thanks, that was really helpful. I've moved forward but still not found
>>> the
>>> solution.
>>> So the status of the initial URL
>>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>>> is:
>>> Status: 5 (db_redir_perm)
>>> Metadata: _pst_: moved(12), lastModified=0:
>>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>>
>>> So it answers the question, why initial page was not indexed - because it
>>> was redirected.
>>> Now checking the status of redirect target:
>>> Status: 2 (db_fetched)
>>>
>>> So it was sucessfully fetchet. But, according to indexing log - it still
>>> was
>>> not sent to indexer!
>>>
>>>
>>>
>>> reinhard schwab wrote:
>>>  
>>>      
>>>> what is the db status of this url in your crawl db?
>>>> if it is STATUS_DB_NOTMODIFIED,
>>>> then it may be the reason.
>>>> (you can check it if you dump your crawl db with
>>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>>
>>>> it has this status, if it is recrawled and the signature does not
>>>> change.
>>>> the signature is MD5 hash of the content.
>>>>
>>>> another reason may be that you have some indexing filters.
>>>> i dont believe its the reason here.
>>>>
>>>> regards
>>>>
>>>>
>>>> kevin chen schrieb:
>>>>    
>>>>        
>>>>> I have similar experience.
>>>>>
>>>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>>>> Reinhard schwab  at
>>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>>
>>>>> I haven't have chance to try it out yet.
>>>>>  
>>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>>  
>>>>>      
>>>>>          
>>>>>> Hi All,
>>>>>>
>>>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>>>> fetches. For example URL:
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>>> mentioned
>>>>>> only
>>>>>> once:
>>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>>> fetching
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>>
>>>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>>> executed). What could be possible reason? Is there a way to browse
>>>>>> crawldb
>>>>>> to ensure that page really fetched? What else could I check?
>>>>>>
>>>>>> Thanks
>>>>>>    
>>>>>>        
>>>>>>            
>>>>>  
>>>>>      
>>>>>          
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

Andrzej Białecki-2
caezar wrote:
> Some more information. Debugging reduce method I've noticed, that before code
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> my page has fetchDatum, parseText and parseData not null, but dbDatum is
> null. Thats why it's skipped :)
> Any ideas about the reason?

Yes - you should run updatedb with this segment, and also run
invertlinks with this segment, _before_ trying to index. Otherwise the
db status won't be updated properly.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
I'm pretty sure that I ran both commands before indexing
Andrzej Bialecki wrote
caezar wrote:
> Some more information. Debugging reduce method I've noticed, that before code
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> my page has fetchDatum, parseText and parseData not null, but dbDatum is
> null. Thats why it's skipped :)
> Any ideas about the reason?

Yes - you should run updatedb with this segment, and also run
invertlinks with this segment, _before_ trying to index. Otherwise the
db status won't be updated properly.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
I've compared the segments data of the URL which have no redirect and was indexed correctly, with this "bad" URL, and there is really a difference. First one have db record in the segment:
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Oct 28 16:01:05 EET 2009
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1256738472613
 
But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option?
caezar wrote
I'm pretty sure that I ran both commands before indexing
Andrzej Bialecki wrote
caezar wrote:
> Some more information. Debugging reduce method I've noticed, that before code
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> my page has fetchDatum, parseText and parseData not null, but dbDatum is
> null. Thats why it's skipped :)
> Any ideas about the reason?

Yes - you should run updatedb with this segment, and also run
invertlinks with this segment, _before_ trying to index. Otherwise the
db status won't be updated properly.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

reinhard
is your problem solved now???

this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.

regards

caezar schrieb:

> I've compared the segments data of the URL which have no redirect and was
> indexed correctly, with this "bad" URL, and there is really a difference.
> First one have db record in the segment:
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Oct 28 16:01:05 EET 2009
> Modified time: Thu Jan 01 02:00:00 EET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1256738472613
>  
> But second one have no such record, which seems pretty fine: it was not
> added to the segment on generate stage, it was added on the fetch stage. Is
> this a bug in Nutch? Or I'm missing some configuration option?
>
> caezar wrote:
>  
>> I'm pretty sure that I ran both commands before indexing
>>
>> Andrzej Bialecki wrote:
>>    
>>> caezar wrote:
>>>      
>>>> Some more information. Debugging reduce method I've noticed, that before
>>>> code
>>>>     if (fetchDatum == null || dbDatum == null
>>>>         || parseText == null || parseData == null) {
>>>>       return;                                     // only have inlinks
>>>>     }
>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>>>> null. Thats why it's skipped :)
>>>> Any ideas about the reason?
>>>>        
>>> Yes - you should run updatedb with this segment, and also run
>>> invertlinks with this segment, _before_ trying to index. Otherwise the
>>> db status won't be updated properly.
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>>      
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
No, problem is not solved. Everything happens as you described, but page is not indexed, because of condition:
    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }
in IndexerMapReduce code. For this page dbDatum is null, so it is not indexed!
reinhard schwab wrote
is your problem solved now???

this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.

regards

caezar schrieb:
> I've compared the segments data of the URL which have no redirect and was
> indexed correctly, with this "bad" URL, and there is really a difference.
> First one have db record in the segment:
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Oct 28 16:01:05 EET 2009
> Modified time: Thu Jan 01 02:00:00 EET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1256738472613
>  
> But second one have no such record, which seems pretty fine: it was not
> added to the segment on generate stage, it was added on the fetch stage. Is
> this a bug in Nutch? Or I'm missing some configuration option?
>
> caezar wrote:
>  
>> I'm pretty sure that I ran both commands before indexing
>>
>> Andrzej Bialecki wrote:
>>    
>>> caezar wrote:
>>>      
>>>> Some more information. Debugging reduce method I've noticed, that before
>>>> code
>>>>     if (fetchDatum == null || dbDatum == null
>>>>         || parseText == null || parseData == null) {
>>>>       return;                                     // only have inlinks
>>>>     }
>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>>>> null. Thats why it's skipped :)
>>>> Any ideas about the reason?
>>>>        
>>> Yes - you should run updatedb with this segment, and also run
>>> invertlinks with this segment, _before_ trying to index. Otherwise the
>>> db status won't be updated properly.
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>>      
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

reinhard
what is in the crawl db?

reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>


caezar schrieb:

> No, problem is not solved. Everything happens as you described, but page is
> not indexed, because of condition:
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> in IndexerMapReduce code. For this page dbDatum is null, so it is not
> indexed!
>
> reinhard schwab wrote:
>  
>> is your problem solved now???
>>
>> this can be ok.
>> new discovered urls will be added to a segment when fetched documents
>> are parsed and if these urls pass the filters.
>> they will not have a crawl datum Generate because they are unknown until
>> they are extracted.
>>
>> regards
>>
>> caezar schrieb:
>>    
>>> I've compared the segments data of the URL which have no redirect and was
>>> indexed correctly, with this "bad" URL, and there is really a difference.
>>> First one have db record in the segment:
>>> Crawl Generate::
>>> Version: 7
>>> Status: 1 (db_unfetched)
>>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>>> Modified time: Thu Jan 01 02:00:00 EET 1970
>>> Retries since fetch: 0
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 1.0
>>> Signature: null
>>> Metadata: _ngt_: 1256738472613
>>>  
>>> But second one have no such record, which seems pretty fine: it was not
>>> added to the segment on generate stage, it was added on the fetch stage.
>>> Is
>>> this a bug in Nutch? Or I'm missing some configuration option?
>>>
>>> caezar wrote:
>>>  
>>>      
>>>> I'm pretty sure that I ran both commands before indexing
>>>>
>>>> Andrzej Bialecki wrote:
>>>>    
>>>>        
>>>>> caezar wrote:
>>>>>      
>>>>>          
>>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>>> before
>>>>>> code
>>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>>         || parseText == null || parseData == null) {
>>>>>>       return;                                     // only have inlinks
>>>>>>     }
>>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum
>>>>>> is
>>>>>> null. Thats why it's skipped :)
>>>>>> Any ideas about the reason?
>>>>>>        
>>>>>>            
>>>>> Yes - you should run updatedb with this segment, and also run
>>>>> invertlinks with this segment, _before_ trying to index. Otherwise the
>>>>> db status won't be updated properly.
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>>
>>>>>      
>>>>>          
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
Status: 5 (db_redir_perm) for redirect source
and
Status: 2 (db_fetched) for redirect target
reinhard schwab wrote
what is in the crawl db?

reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>


caezar schrieb:
> No, problem is not solved. Everything happens as you described, but page is
> not indexed, because of condition:
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> in IndexerMapReduce code. For this page dbDatum is null, so it is not
> indexed!
>
> reinhard schwab wrote:
>  
>> is your problem solved now???
>>
>> this can be ok.
>> new discovered urls will be added to a segment when fetched documents
>> are parsed and if these urls pass the filters.
>> they will not have a crawl datum Generate because they are unknown until
>> they are extracted.
>>
>> regards
>>
>> caezar schrieb:
>>    
>>> I've compared the segments data of the URL which have no redirect and was
>>> indexed correctly, with this "bad" URL, and there is really a difference.
>>> First one have db record in the segment:
>>> Crawl Generate::
>>> Version: 7
>>> Status: 1 (db_unfetched)
>>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>>> Modified time: Thu Jan 01 02:00:00 EET 1970
>>> Retries since fetch: 0
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 1.0
>>> Signature: null
>>> Metadata: _ngt_: 1256738472613
>>>  
>>> But second one have no such record, which seems pretty fine: it was not
>>> added to the segment on generate stage, it was added on the fetch stage.
>>> Is
>>> this a bug in Nutch? Or I'm missing some configuration option?
>>>
>>> caezar wrote:
>>>  
>>>      
>>>> I'm pretty sure that I ran both commands before indexing
>>>>
>>>> Andrzej Bialecki wrote:
>>>>    
>>>>        
>>>>> caezar wrote:
>>>>>      
>>>>>          
>>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>>> before
>>>>>> code
>>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>>         || parseText == null || parseData == null) {
>>>>>>       return;                                     // only have inlinks
>>>>>>     }
>>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum
>>>>>> is
>>>>>> null. Thats why it's skipped :)
>>>>>> Any ideas about the reason?
>>>>>>        
>>>>>>            
>>>>> Yes - you should run updatedb with this segment, and also run
>>>>> invertlinks with this segment, _before_ trying to index. Otherwise the
>>>>> db status won't be updated properly.
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>>
>>>>>      
>>>>>          
>>>>    
>>>>        
>>>  
>>>      
>>
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

J. Smith-2
Does anybody know how to solve this problem?

Reply | Threaded
Open this post in threaded view
|

Re: Nutch indexes less pages, then it fetches

caezar
I've solved this problem by modifying nutch code. If this solution acceptable for you I can provide the details
J. Smith wrote
Does anybody know how to solve this problem?
12