Seeking Insight into Nutch Configurations

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Seeking Insight into Nutch Configurations

Scott Gonyea
Hi, I've been digging through Google and the archives quite thoroughly, to little avail. Please excuse any grammar mistakes; I just moved and lack Internet for my laptop.

The big problem that I am facing, thus far, occurs on the 4th fetch. All but 1 or 2 maps complete. All of the running reduces stall (0.00 MB/s), presumably because they are waiting on that map to finish? I really don't know and it's frustrating.

I've been playing heavily with the formula, but however many maps/reduces I set in mapred-site, it has the same outcome.

I've created dozens of hadoop AMIs that have had tweaks in the following ranges:
Memory assigned: 512m-2048m
Fetcher threads: 64-1024 (King of the DoS!)
Tracker Concurrent Maps: 1-32
Jobtracker Total Maps: 11(1/node)-1091~
Tracker Concurrent Reduces: 1-32
Jobtracker Total Reduces: 11(1/node)-1091~

There are more and I'll share some of my conf files once I'm able to do so. I would sincerely appreciate some insight into how to configure the various settings in Nutch/Hadoop.

My scenario:
# Sites: 10,000-30,000 per crawl
Depth: ~5
Content: Text is all that I care for. (HTML/RSS/XML)
Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I thought S3 would be more performant, yet it doesn't appear to affect matters.
Cost vs Speed: I don't mind throwing EC2 instances at this to get it done quickly... But I can't imagine I need much more than 10-20 mid-size instances for this.

Can anyone share their own experiences in the performance they've seen?

Thank you very much,
Scott Gonyea
Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Andrzej Białecki-2
On 2010-08-02 10:17, Scott Gonyea wrote:
> The big problem that I am facing, thus far, occurs on the 4th fetch.
> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> MB/s), presumably because they are waiting on that map to finish? I
> really don't know and it's frustrating.

Yes, all map tasks need to finish before reduce tasks are able to
proceed. The reason is that each reduce task receives a portion of the
keyspace (and values) according to the Partitioner, and in order to
prepare a nice <key, list(value)> in your reducer it needs to, well, get
all the values under this key first, whichever map task produced the
tuples, and then sort them.

The failing tasks probably fail due to some other factor, and very
likely (based on my experience) the failure is related to some
particular URLs. E.g. regex URL filtering can choke on some pathological
URLs, like URLs 20kB long, or containing '\0' etc, etc. In my
experience, it's best to keep regex filtering to a minimum if you can,
and use other urlfilters (prefix, domain, suffix, custom) to limit your
crawling frontier. There are simply too many ways where a regex engine
can lock up.

Please check the logs of the failing tasks. If you see that a task is
stalled you could also log in to the node, and generate a thread dump a
few times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
regex processing then it's likely this is your problem.

> My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> Storage: I've performed crawls with HDFS and with amazon S3. I
> thought S3 would be more performant, yet it doesn't appear to affect
> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> to get it done quickly... But I can't imagine I need much more than
> 10-20 mid-size instances for this.

That's correct - with this number of unique sites the max. throughput of
your crawl will be ultimately limited by the politeness limits (# of
requests/site/sec).

>
> Can anyone share their own experiences in the performance they've
> seen?

There is a very simple benchmark in trunk/ that you could use to measure
the raw performance (data processing throughput) of your EC2 cluster.
The real-life performance, though, will depend on many other factors,
such as the number of unique sites, their individual speed, and (rarely)
the total bandwidth at your end.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Scott Gonyea-2
Thank you very much, Adrzej.  I'm really hoping some people can share some non-sensitive details of their setup.  I'm really curious about the following:

The ratio of Maps to Reduces for their nutch jobs?
The amount of memory that they allocate to each job task?
The number of simultaneous Maps/Reduces on any given host?
The number of fetcher threads they execute?

Any config setup people can share would be great, so I can have a different perspective on how people setup their nutch-site and mapred-site files.

At the moment, I'm experimenting with the following configs:


I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are both prime numbers, but I don't know if that really matters.  I've seen Hadoop say that the number of reducers should be around the number of nodes you have (the nearest prime).  I've seen, somewhere, some suggestions that Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight to share on that?

Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm waiting for it to return to the 4th fetch step, so I can see why Nutch hates me so much.

sg

On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]> wrote:
On 2010-08-02 10:17, Scott Gonyea wrote:
The big problem that I am facing, thus far, occurs on the 4th fetch.
All but 1 or 2 maps complete. All of the running reduces stall (0.00
MB/s), presumably because they are waiting on that map to finish? I
really don't know and it's frustrating.

Yes, all map tasks need to finish before reduce tasks are able to proceed. The reason is that each reduce task receives a portion of the keyspace (and values) according to the Partitioner, and in order to prepare a nice <key, list(value)> in your reducer it needs to, well, get all the values under this key first, whichever map task produced the tuples, and then sort them.

The failing tasks probably fail due to some other factor, and very likely (based on my experience) the failure is related to some particular URLs. E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB long, or containing '\0' etc, etc. In my experience, it's best to keep regex filtering to a minimum if you can, and use other urlfilters (prefix, domain, suffix, custom) to limit your crawling frontier. There are simply too many ways where a regex engine can lock up.

Please check the logs of the failing tasks. If you see that a task is stalled you could also log in to the node, and generate a thread dump a few times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex processing then it's likely this is your problem.


My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I
thought S3 would be more performant, yet it doesn't appear to affect
matters. Cost vs Speed: I don't mind throwing EC2 instances at this
to get it done quickly... But I can't imagine I need much more than
10-20 mid-size instances for this.

That's correct - with this number of unique sites the max. throughput of your crawl will be ultimately limited by the politeness limits (# of requests/site/sec).



Can anyone share their own experiences in the performance they've
seen?

There is a very simple benchmark in trunk/ that you could use to measure the raw performance (data processing throughput) of your EC2 cluster. The real-life performance, though, will depend on many other factors, such as the number of unique sites, their individual speed, and (rarely) the total bandwidth at your end.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Scott Gonyea
By the way, can anyone tell me if there is a way to explicitly limit how many pages should be fetched, per fetcher-task?

I know that one limit to this, is that each site/domain/whatever could exceed that limit (assuming the limit were lower than the number of those sites).  For politeness, that limit would have to be soft.  But that's more than suitable, in my opinion.

I think part of the problem is that, seemingly, Nutch seems to be generating some really unbalanced fetcher tasks.

The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each higher-numbered task had fewer pages to fetch.  Task 000180 only had 44 pages to fetch.

This *huge* imbalance, I think, creates tasks that are seemingly unpredictable.  All of my other resources just sit around, wasting resources, until one task grabs some crazy number of sites.

Thanks again,
sg

On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <[hidden email]> wrote:
Thank you very much, Adrzej.  I'm really hoping some people can share some non-sensitive details of their setup.  I'm really curious about the following:

The ratio of Maps to Reduces for their nutch jobs?
The amount of memory that they allocate to each job task?
The number of simultaneous Maps/Reduces on any given host?
The number of fetcher threads they execute?

Any config setup people can share would be great, so I can have a different perspective on how people setup their nutch-site and mapred-site files.

At the moment, I'm experimenting with the following configs:


I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are both prime numbers, but I don't know if that really matters.  I've seen Hadoop say that the number of reducers should be around the number of nodes you have (the nearest prime).  I've seen, somewhere, some suggestions that Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight to share on that?

Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm waiting for it to return to the 4th fetch step, so I can see why Nutch hates me so much.

sg

On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]> wrote:
On 2010-08-02 10:17, Scott Gonyea wrote:
The big problem that I am facing, thus far, occurs on the 4th fetch.
All but 1 or 2 maps complete. All of the running reduces stall (0.00
MB/s), presumably because they are waiting on that map to finish? I
really don't know and it's frustrating.

Yes, all map tasks need to finish before reduce tasks are able to proceed. The reason is that each reduce task receives a portion of the keyspace (and values) according to the Partitioner, and in order to prepare a nice <key, list(value)> in your reducer it needs to, well, get all the values under this key first, whichever map task produced the tuples, and then sort them.

The failing tasks probably fail due to some other factor, and very likely (based on my experience) the failure is related to some particular URLs. E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB long, or containing '\0' etc, etc. In my experience, it's best to keep regex filtering to a minimum if you can, and use other urlfilters (prefix, domain, suffix, custom) to limit your crawling frontier. There are simply too many ways where a regex engine can lock up.

Please check the logs of the failing tasks. If you see that a task is stalled you could also log in to the node, and generate a thread dump a few times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex processing then it's likely this is your problem.


My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I
thought S3 would be more performant, yet it doesn't appear to affect
matters. Cost vs Speed: I don't mind throwing EC2 instances at this
to get it done quickly... But I can't imagine I need much more than
10-20 mid-size instances for this.

That's correct - with this number of unique sites the max. throughput of your crawl will be ultimately limited by the politeness limits (# of requests/site/sec).



Can anyone share their own experiences in the performance they've
seen?

There is a very simple benchmark in trunk/ that you could use to measure the raw performance (data processing throughput) of your EC2 cluster. The real-life performance, though, will depend on many other factors, such as the number of unique sites, their individual speed, and (rarely) the total bandwidth at your end.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Andrzej Białecki-2
On 2010-08-02 22:59, Scott Gonyea wrote:
> By the way, can anyone tell me if there is a way to explicitly limit how
> many pages should be fetched, per fetcher-task?

I believe that in general case it would be a very complex problem to
solve so that you get exact results. The reason is that Nutch doesn't
use any global lock manager, so the only way to ensure a proper per-host
locking is to assign all URL-s from any given host to the same map task.
This may (and often will) create an imbalance in the number of allocated
URL-s per task.

One method to mitigate this imbalance is to set generate.max.count (in
trunk, generate.max.per.host in 1.1) - this will limit the number of
URL-s from any given host to X, thus helping in a more balanced mixing
of these N per-host chunks across M maps.

> I think part of the problem is that, seemingly, Nutch seems to be
> generating some really unbalanced fetcher tasks.
>
> The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.
>   Each higher-numbered task had fewer pages to fetch.  Task 000180 only
> had 44 pages to fetch.

There's no specific tool to examine the composition of fetchlist
parts... try running this in the segments/2010*/crawl_generate/:

for i in part-00*
do
        echo "---- part $i -----"
        strings $i | grep http://
done

to print URL-s per map task. Most likely you will see that there was no
other way to allocate the URLs per task to satisfy the constraint that I
explained above. If it's not the case, then it's a bug. :)

>
> This *huge* imbalance, I think, creates tasks that are seemingly
> unpredictable.  All of my other resources just sit around, wasting
> resources, until one task grabs some crazy number of sites.

Again, generate.max.count is your friend - even though you won't be able
to get all pages from a big site in one go, at least your crawls will
finish quickly and you will quickly progress breadth-wise, if not
depth-wise.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Andrzej Białecki-2
In reply to this post by Scott Gonyea-2
On 2010-08-02 20:57, Scott Gonyea wrote:
> Thank you very much, Adrzej.  I'm really hoping some people can share
> some non-sensitive details of their setup.  I'm really curious about the
> following:
>
> The ratio of Maps to Reduces for their nutch jobs?

This depends on the job and the amount of data. The more data, the more
map tasks you will have. The number of reduce tasks is fixed, and it
should be set so that the sort and reduce operations per reduce task
should operate on a reasonably-sized chunk of output data. Hence the
recommendation to set it to something between 1x-2x the number of nodes.

(BTW, the thing about primes is to avoid task scheduling issues, esp. in
presence of speculative execution... but that's another subject).


> The amount of memory that they allocate to each job task?

Sufficient for the task ;) both map and reduce operate on a fixed memory
budget, using on-disk iterators - so you need not to worry about the
total number of records you want to process, just allocate enough memory
to correctly process a tuple, with some room to spare for Hadoop task
buffers and optionally a bit more if you use Combiner. All in all, I
rarely see a good reason to go above 768MB, and often use less than that.

> The number of simultaneous Maps/Reduces on any given host?

Depends on the host - amount of RAM/CPU. I usually use value starting
from 2 maps (low end hardware) to 4 (regular hardware) to 8 (higher end
hardware), whatever the low/regular/high means.. Reduce tasks include
also the sorting, which is IO intensive, so I usually allocate 1-2 per node.

> The number of fetcher threads they execute?

Something between 10-100. With higher values you need a dedicated DNS
cache - 100 threads all looking up IP-s take their toll...


> I'm giving each task 2048m of memory.

This is likely too much. Not that it really hurts if you have enough
RAM... but JVMs may be actually less efficient if you give them
unnecessarily huge heaps.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com