weird fetcher behavior

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

weird fetcher behavior

Florent Gluck
Hi,

I'm running nutch trunk as of today.  I have 3 slaves and a master.  I'm
using mapred.map.tasks=20 and mapred.reduce.tasks=4.
There is something I'm really confused about.

When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb: crawldb
060110 171347 TOTAL urls:       27939
060110 171347 avg score:        1.011
060110 171347 max score:        8.883
060110 171347 min score:        1.0
060110 171347 retry 0:  26429
060110 171347 retry 1:  1510
060110 171347 status 1 (DB_unfetched):  24248
060110 171347 status 2 (DB_fetched):    3390
060110 171347 status 3 (DB_gone):       301
060110 171347 CrawlDb statistics: done

There are several things that don't make sense to me and it would be
great if someone could clear this up:

1.
If I compute the number of occurences of "fetching" in all of my slaves'
tasktracker logs, I get: 6225
This number clearly doesn't match the DB_fetched of 3390 from the
readdb output.  Why is that ?
What happened to the 6225-3390=2835 missing urls?

2.
Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
Why is it not 25000 ?

3.
What is the meaning of DB_gone and DB_unfetched?
I was assuming if you inject a total of 25k urls where 5000 are
fetchable ones, you would get something like:
(DB_unfetched):  20000
(DB_fetched):    5000
It's not the case, so I'd like to understand what's exactly going on here.
Also, what is the meaning of DB_gone ?

4.
If I redo (starting from an empty crawldb of course) the exact same
inject + crawl with the same 25000 urls, but I use the following mapred
settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
get the following readdb output:
060110 162140 TOTAL urls:       33173
060110 162140 avg score:        1.026
060110 162140 max score:        22.083
060110 162140 min score:        1.0
060110 162140 retry 0:  28381
060110 162140 retry 1:  4792
060110 162140 status 1 (DB_unfetched):  23136
060110 162140 status 2 (DB_fetched):    9234
060110 162140 status 3 (DB_gone):       803
060110 162140 CrawlDb statistics: done
How come the DB_fetched is about 3x more than earlier and the TOTAL urls goes
way beyond the 27939 from before?
It doesn't make any sense.  I'd expect to see similar results as before
with the other mapred settings.

Thank you,
Florent

Reply | Threaded
Open this post in threaded view
|

Re: weird fetcher behavior

Doug Cutting-2
Florent Gluck wrote:

> When I inject 25000 urls and fetch them (depth = 1) and do a readdb
> -stats, I get:
> 060110 171347 Statistics for CrawlDb: crawldb
> 060110 171347 TOTAL urls:       27939
> 060110 171347 avg score:        1.011
> 060110 171347 max score:        8.883
> 060110 171347 min score:        1.0
> 060110 171347 retry 0:  26429
> 060110 171347 retry 1:  1510
> 060110 171347 status 1 (DB_unfetched):  24248
> 060110 171347 status 2 (DB_fetched):    3390
> 060110 171347 status 3 (DB_gone):       301
> 060110 171347 CrawlDb statistics: done
>
> There are several things that don't make sense to me and it would be
> great if someone could clear this up:
>
> 1.
> If I compute the number of occurences of "fetching" in all of my slaves'
> tasktracker logs, I get: 6225
> This number clearly doesn't match the DB_fetched of 3390 from the
> readdb output.  Why is that ?
> What happened to the 6225-3390=2835 missing urls?

How many errors are you seeing while fetching?  Are you getting, e.g.,
lots of timeouts or "max delays exceeded"?

You might also try using protocol-http rather than protocol-httpclient.
  Others have reported under-fetching issues with protocol-httpclient.

> 2.
> Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
> Why is it not 25000 ?

When the crawl db is updated it adds pages linked to by fetched pages,
with status DB_unfetched.

> 3.
> What is the meaning of DB_gone and DB_unfetched?
> I was assuming if you inject a total of 25k urls where 5000 are
> fetchable ones, you would get something like:
> (DB_unfetched):  20000
> (DB_fetched):    5000
> It's not the case, so I'd like to understand what's exactly going on here.
> Also, what is the meaning of DB_gone ?

DB_gone means that a 404 or some other presumably permanent error was
encountered.  This status prevents future attempts to fetch a url.

> 4.
> If I redo (starting from an empty crawldb of course) the exact same
> inject + crawl with the same 25000 urls, but I use the following mapred
> settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
> get the following readdb output:
> 060110 162140 TOTAL urls:       33173
> 060110 162140 avg score:        1.026
> 060110 162140 max score:        22.083
> 060110 162140 min score:        1.0
> 060110 162140 retry 0:  28381
> 060110 162140 retry 1:  4792
> 060110 162140 status 1 (DB_unfetched):  23136
> 060110 162140 status 2 (DB_fetched):    9234
> 060110 162140 status 3 (DB_gone):       803
> 060110 162140 CrawlDb statistics: done
> How come the DB_fetched is about 3x more than earlier and the TOTAL urls goes
> way beyond the 27939 from before?
> It doesn't make any sense.  I'd expect to see similar results as before
> with the other mapred settings.

Please look at your fetcher errors.  More, smaller fetch lists means
that each fetcher task has fewer unique hosts.  I'd actually expect
fewer pages to succeed, but only an analysis of your fetcher errors will
fully explain this.

Again, the reason that the total is higher is that it includes new urls
discovered.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: weird fetcher behavior

Florent Gluck
Thanks for your answers Doug, it makes more sense now.
I'm still puzzled about why the number of DB_fetched changes so much
when using different number for the map/reduce task settings.
I'm gonna inspect the logs and see if I can track down what's going on.
Also, I tried to use protocol-http rather than protocol-httpclient, but
it didn't make any difference.

Thanks,
Florent

Doug Cutting wrote:

> Florent Gluck wrote:
>
>> When I inject 25000 urls and fetch them (depth = 1) and do a readdb
>> -stats, I get:
>> 060110 171347 Statistics for CrawlDb: crawldb
>> 060110 171347 TOTAL urls:       27939
>> 060110 171347 avg score:        1.011
>> 060110 171347 max score:        8.883
>> 060110 171347 min score:        1.0
>> 060110 171347 retry 0:  26429
>> 060110 171347 retry 1:  1510
>> 060110 171347 status 1 (DB_unfetched):  24248
>> 060110 171347 status 2 (DB_fetched):    3390
>> 060110 171347 status 3 (DB_gone):       301
>> 060110 171347 CrawlDb statistics: done
>>
>> There are several things that don't make sense to me and it would be
>> great if someone could clear this up:
>>
>> 1.
>> If I compute the number of occurences of "fetching" in all of my slaves'
>> tasktracker logs, I get: 6225
>> This number clearly doesn't match the DB_fetched of 3390 from the
>> readdb output.  Why is that ?
>> What happened to the 6225-3390=2835 missing urls?
>
>
> How many errors are you seeing while fetching?  Are you getting, e.g.,
> lots of timeouts or "max delays exceeded"?
>
> You might also try using protocol-http rather than
> protocol-httpclient.  Others have reported under-fetching issues with
> protocol-httpclient.
>
>> 2.
>> Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
>> Why is it not 25000 ?
>
>
> When the crawl db is updated it adds pages linked to by fetched pages,
> with status DB_unfetched.
>
>> 3.
>> What is the meaning of DB_gone and DB_unfetched?
>> I was assuming if you inject a total of 25k urls where 5000 are
>> fetchable ones, you would get something like:
>> (DB_unfetched):  20000
>> (DB_fetched):    5000
>> It's not the case, so I'd like to understand what's exactly going on
>> here.
>> Also, what is the meaning of DB_gone ?
>
>
> DB_gone means that a 404 or some other presumably permanent error was
> encountered.  This status prevents future attempts to fetch a url.
>
>> 4.
>> If I redo (starting from an empty crawldb of course) the exact same
>> inject + crawl with the same 25000 urls, but I use the following mapred
>> settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
>> get the following readdb output:
>> 060110 162140 TOTAL urls:       33173
>> 060110 162140 avg score:        1.026
>> 060110 162140 max score:        22.083
>> 060110 162140 min score:        1.0
>> 060110 162140 retry 0:  28381
>> 060110 162140 retry 1:  4792
>> 060110 162140 status 1 (DB_unfetched):  23136
>> 060110 162140 status 2 (DB_fetched):    9234
>> 060110 162140 status 3 (DB_gone):       803
>> 060110 162140 CrawlDb statistics: done
>> How come the DB_fetched is about 3x more than earlier and the TOTAL
>> urls goes
>> way beyond the 27939 from before?
>> It doesn't make any sense.  I'd expect to see similar results as before
>> with the other mapred settings.
>
>
> Please look at your fetcher errors.  More, smaller fetch lists means
> that each fetcher task has fewer unique hosts.  I'd actually expect
> fewer pages to succeed, but only an analysis of your fetcher errors
> will fully explain this.
>
> Again, the reason that the total is higher is that it includes new
> urls discovered.
>
> Doug
>