Timeout Errors Percentages on Large Fetches

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Timeout Errors Percentages on Large Fetches

Dennis Kubes
What is anybody else seeing for timeout percentages on large fetches?

We are running a 1M page crawl and seeing about a 10% timeout rate on a
2Mbps line running about 165 fetchers I think.  We have the
fetcher.threads.fetch set to 3 but have 55 map tasks as a default on a
11 node cluster. If I am not mistaken this works out to 165 fetchers.  
It is running about 16 pages / second  with about 10% timeout and I
didn't know if that would be due to my settings pretty much pegging the
available bandwidth or the websites I am crawling being down or
non-responsive.  10% seemed a little high to be downed or non-responding
sites.

Dennis
Reply | Threaded
Open this post in threaded view
|

RE: Timeout Errors Percentages on Large Fetches

Ledio Ago
I don't think 10% is bad, but also look at URLs that come from the same host.
Nutch fetcher does rate control when it hits the same host multiple times in
sequence.  Look at "Retry Later..." errors, how much of the 10% is Retry...

-Ledio

-----Original Message-----
From: Dennis Kubes [mailto:[hidden email]]
Sent: Thursday, May 18, 2006 2:55 PM
To: [hidden email]
Subject: Timeout Errors Percentages on Large Fetches


What is anybody else seeing for timeout percentages on large fetches?

We are running a 1M page crawl and seeing about a 10% timeout rate on a
2Mbps line running about 165 fetchers I think.  We have the
fetcher.threads.fetch set to 3 but have 55 map tasks as a default on a
11 node cluster. If I am not mistaken this works out to 165 fetchers.  
It is running about 16 pages / second  with about 10% timeout and I
didn't know if that would be due to my settings pretty much pegging the
available bandwidth or the websites I am crawling being down or
non-responsive.  10% seemed a little high to be downed or non-responding
sites.

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: Timeout Errors Percentages on Large Fetches

Andrzej Białecki-2
In reply to this post by Dennis Kubes
Dennis Kubes wrote:

> What is anybody else seeing for timeout percentages on large fetches?
> We are running a 1M page crawl and seeing about a 10% timeout rate on
> a 2Mbps line running about 165 fetchers I think.  We have the
> fetcher.threads.fetch set to 3 but have 55 map tasks as a default on a
> 11 node cluster. If I am not mistaken this works out to 165 fetchers.  
> It is running about 16 pages / second  with about 10% timeout and I
> didn't know if that would be due to my settings pretty much pegging
> the available bandwidth or the websites I am crawling being down or
> non-responsive.  10% seemed a little high to be downed or
> non-responding sites.

I'm getting these symptoms when the total number of threads is too high
for available bandwidth.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Timeout Errors Percentages on Large Fetches

Dennis Kubes
Ok  when this crawl is finished (only 10 hours to go) I will try with
less fetchers.  Am I right about the number of fetchers being
fetcher.threads.fetch * number of map tasks running?

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> What is anybody else seeing for timeout percentages on large fetches?
>> We are running a 1M page crawl and seeing about a 10% timeout rate on
>> a 2Mbps line running about 165 fetchers I think.  We have the
>> fetcher.threads.fetch set to 3 but have 55 map tasks as a default on
>> a 11 node cluster. If I am not mistaken this works out to 165
>> fetchers.  It is running about 16 pages / second  with about 10%
>> timeout and I didn't know if that would be due to my settings pretty
>> much pegging the available bandwidth or the websites I am crawling
>> being down or non-responsive.  10% seemed a little high to be downed
>> or non-responding sites.
>
> I'm getting these symptoms when the total number of threads is too
> high for available bandwidth.
>
Reply | Threaded
Open this post in threaded view
|

Re: Timeout Errors Percentages on Large Fetches

Andrzej Białecki-2
Dennis Kubes wrote:
> Ok  when this crawl is finished (only 10 hours to go) I will try with
> less fetchers.  Am I right about the number of fetchers being
> fetcher.threads.fetch * number of map tasks running?
>

Yes (modulo the "politeness" blocking, but I'm assuming you don't crawl
a single site...).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com