Seeking Insight into Nutch Configurations

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Seeking Insight into Nutch Configurations

Scott Gonyea
Hi, I've been digging through Google and the archives quite thoroughly, to little avail. Please excuse any grammar mistakes; I just moved and lack Internet for my laptop.

The big problem that I am facing, thus far, occurs on the 4th fetch. All but 1 or 2 maps complete. All of the running reduces stall (0.00 MB/s), presumably because they are waiting on that map to finish? I really don't know and it's frustrating.

I've been playing heavily with the formula, but however many maps/reduces I set in mapred-site, it has the same outcome.

I've created dozens of hadoop AMIs that have had tweaks in the following ranges:
Memory assigned: 512m-2048m
Fetcher threads: 64-1024 (King of the DoS!)
Tracker Concurrent Maps: 1-32
Jobtracker Total Maps: 11(1/node)-1091~
Tracker Concurrent Reduces: 1-32
Jobtracker Total Reduces: 11(1/node)-1091~

There are more and I'll share some of my conf files once I'm able to do so. I would sincerely appreciate some insight into how to configure the various settings in Nutch/Hadoop.

My scenario:
# Sites: 10,000-30,000 per crawl
Depth: ~5
Content: Text is all that I care for. (HTML/RSS/XML)
Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I thought S3 would be more performant, yet it doesn't appear to affect matters.
Cost vs Speed: I don't mind throwing EC2 instances at this to get it done quickly... But I can't imagine I need much more than 10-20 mid-size instances for this.

Can anyone share their own experiences in the performance they've seen?

Thank you very much,
Scott Gonyea
Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Scott Gonyea-2
Thank you very much, Adrzej.  I'm really hoping some people can share some
non-sensitive details of their setup.  I'm really curious about the
following:

The ratio of Maps to Reduces for their nutch jobs?
The amount of memory that they allocate to each job task?
The number of simultaneous Maps/Reduces on any given host?
The number of fetcher threads they execute?

Any config setup people can share would be great, so I can have a different
perspective on how people setup their nutch-site and mapred-site files.

At the moment, I'm experimenting with the following configs:

http://gist.github.com/505065

I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at any
given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are
both prime numbers, but I don't know if that really matters.  I've seen
Hadoop say that the number of reducers should be around the number of nodes
you have (the nearest prime).  I've seen, somewhere, some suggestions that
Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight
to share on that?

Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
waiting for it to return to the 4th fetch step, so I can see why Nutch hates
me so much.

sg

On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]> wrote:

> On 2010-08-02 10:17, Scott Gonyea wrote:
>
>> The big problem that I am facing, thus far, occurs on the 4th fetch.
>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
>> MB/s), presumably because they are waiting on that map to finish? I
>> really don't know and it's frustrating.
>>
>
> Yes, all map tasks need to finish before reduce tasks are able to proceed.
> The reason is that each reduce task receives a portion of the keyspace (and
> values) according to the Partitioner, and in order to prepare a nice <key,
> list(value)> in your reducer it needs to, well, get all the values under
> this key first, whichever map task produced the tuples, and then sort them.
>
> The failing tasks probably fail due to some other factor, and very likely
> (based on my experience) the failure is related to some particular URLs.
> E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB
> long, or containing '\0' etc, etc. In my experience, it's best to keep regex
> filtering to a minimum if you can, and use other urlfilters (prefix, domain,
> suffix, custom) to limit your crawling frontier. There are simply too many
> ways where a regex engine can lock up.
>
> Please check the logs of the failing tasks. If you see that a task is
> stalled you could also log in to the node, and generate a thread dump a few
> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex
> processing then it's likely this is your problem.
>
>
>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
>> Storage: I've performed crawls with HDFS and with amazon S3. I
>> thought S3 would be more performant, yet it doesn't appear to affect
>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
>> to get it done quickly... But I can't imagine I need much more than
>> 10-20 mid-size instances for this.
>>
>
> That's correct - with this number of unique sites the max. throughput of
> your crawl will be ultimately limited by the politeness limits (# of
> requests/site/sec).
>
>
>
>> Can anyone share their own experiences in the performance they've
>> seen?
>>
>
> There is a very simple benchmark in trunk/ that you could use to measure
> the raw performance (data processing throughput) of your EC2 cluster. The
> real-life performance, though, will depend on many other factors, such as
> the number of unique sites, their individual speed, and (rarely) the total
> bandwidth at your end.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Scott Gonyea
By the way, can anyone tell me if there is a way to explicitly limit how
many pages should be fetched, per fetcher-task?

I know that one limit to this, is that each site/domain/whatever could
exceed that limit (assuming the limit were lower than the number of those
sites).  For politeness, that limit would have to be soft.  But that's more
than suitable, in my opinion.

I think part of the problem is that, seemingly, Nutch seems to be generating
some really unbalanced fetcher tasks.

The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
pages to fetch.

This *huge* imbalance, I think, creates tasks that are seemingly
unpredictable.  All of my other resources just sit around, wasting
resources, until one task grabs some crazy number of sites.

Thanks again,
sg

On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <[hidden email]> wrote:

> Thank you very much, Adrzej.  I'm really hoping some people can share some
> non-sensitive details of their setup.  I'm really curious about the
> following:
>
> The ratio of Maps to Reduces for their nutch jobs?
> The amount of memory that they allocate to each job task?
> The number of simultaneous Maps/Reduces on any given host?
> The number of fetcher threads they execute?
>
> Any config setup people can share would be great, so I can have a different
> perspective on how people setup their nutch-site and mapred-site files.
>
> At the moment, I'm experimenting with the following configs:
>
> http://gist.github.com/505065
>
> I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at
> any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those are
> both prime numbers, but I don't know if that really matters.  I've seen
> Hadoop say that the number of reducers should be around the number of nodes
> you have (the nearest prime).  I've seen, somewhere, some suggestions that
> Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have insight
> to share on that?
>
> Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
> waiting for it to return to the 4th fetch step, so I can see why Nutch hates
> me so much.
>
> sg
>
> On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]> wrote:
>
>> On 2010-08-02 10:17, Scott Gonyea wrote:
>>
>>> The big problem that I am facing, thus far, occurs on the 4th fetch.
>>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
>>> MB/s), presumably because they are waiting on that map to finish? I
>>> really don't know and it's frustrating.
>>>
>>
>> Yes, all map tasks need to finish before reduce tasks are able to proceed.
>> The reason is that each reduce task receives a portion of the keyspace (and
>> values) according to the Partitioner, and in order to prepare a nice <key,
>> list(value)> in your reducer it needs to, well, get all the values under
>> this key first, whichever map task produced the tuples, and then sort them.
>>
>> The failing tasks probably fail due to some other factor, and very likely
>> (based on my experience) the failure is related to some particular URLs.
>> E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB
>> long, or containing '\0' etc, etc. In my experience, it's best to keep regex
>> filtering to a minimum if you can, and use other urlfilters (prefix, domain,
>> suffix, custom) to limit your crawling frontier. There are simply too many
>> ways where a regex engine can lock up.
>>
>> Please check the logs of the failing tasks. If you see that a task is
>> stalled you could also log in to the node, and generate a thread dump a few
>> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex
>> processing then it's likely this is your problem.
>>
>>
>>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
>>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
>>> Storage: I've performed crawls with HDFS and with amazon S3. I
>>> thought S3 would be more performant, yet it doesn't appear to affect
>>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
>>> to get it done quickly... But I can't imagine I need much more than
>>> 10-20 mid-size instances for this.
>>>
>>
>> That's correct - with this number of unique sites the max. throughput of
>> your crawl will be ultimately limited by the politeness limits (# of
>> requests/site/sec).
>>
>>
>>
>>> Can anyone share their own experiences in the performance they've
>>> seen?
>>>
>>
>> There is a very simple benchmark in trunk/ that you could use to measure
>> the raw performance (data processing throughput) of your EC2 cluster. The
>> real-life performance, though, will depend on many other factors, such as
>> the number of unique sites, their individual speed, and (rarely) the total
>> bandwidth at your end.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

AJ Chen-3
Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
sample of the complete nutch&hadoop configurations for distributed crawling
on EC2 is available for the community, that will help anyone to learn nutch
best practice quickly.
-aj


On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <[hidden email]> wrote:

> By the way, can anyone tell me if there is a way to explicitly limit how
> many pages should be fetched, per fetcher-task?
>
> I know that one limit to this, is that each site/domain/whatever could
> exceed that limit (assuming the limit were lower than the number of those
> sites).  For politeness, that limit would have to be soft.  But that's more
> than suitable, in my opinion.
>
> I think part of the problem is that, seemingly, Nutch seems to be
> generating
> some really unbalanced fetcher tasks.
>
> The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
> higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> pages to fetch.
>
> This *huge* imbalance, I think, creates tasks that are seemingly
> unpredictable.  All of my other resources just sit around, wasting
> resources, until one task grabs some crazy number of sites.
>
> Thanks again,
> sg
>
> On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <[hidden email]> wrote:
>
> > Thank you very much, Adrzej.  I'm really hoping some people can share
> some
> > non-sensitive details of their setup.  I'm really curious about the
> > following:
> >
> > The ratio of Maps to Reduces for their nutch jobs?
> > The amount of memory that they allocate to each job task?
> > The number of simultaneous Maps/Reduces on any given host?
> > The number of fetcher threads they execute?
> >
> > Any config setup people can share would be great, so I can have a
> different
> > perspective on how people setup their nutch-site and mapred-site files.
> >
> > At the moment, I'm experimenting with the following configs:
> >
> > http://gist.github.com/505065
> >
> > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run at
> > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.  Those
> are
> > both prime numbers, but I don't know if that really matters.  I've seen
> > Hadoop say that the number of reducers should be around the number of
> nodes
> > you have (the nearest prime).  I've seen, somewhere, some suggestions
> that
> > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> insight
> > to share on that?
> >
> > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.  I'm
> > waiting for it to return to the 4th fetch step, so I can see why Nutch
> hates
> > me so much.
> >
> > sg
> >
> > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]> wrote:
> >
> >> On 2010-08-02 10:17, Scott Gonyea wrote:
> >>
> >>> The big problem that I am facing, thus far, occurs on the 4th fetch.
> >>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> >>> MB/s), presumably because they are waiting on that map to finish? I
> >>> really don't know and it's frustrating.
> >>>
> >>
> >> Yes, all map tasks need to finish before reduce tasks are able to
> proceed.
> >> The reason is that each reduce task receives a portion of the keyspace
> (and
> >> values) according to the Partitioner, and in order to prepare a nice
> <key,
> >> list(value)> in your reducer it needs to, well, get all the values under
> >> this key first, whichever map task produced the tuples, and then sort
> them.
> >>
> >> The failing tasks probably fail due to some other factor, and very
> likely
> >> (based on my experience) the failure is related to some particular URLs.
> >> E.g. regex URL filtering can choke on some pathological URLs, like URLs
> 20kB
> >> long, or containing '\0' etc, etc. In my experience, it's best to keep
> regex
> >> filtering to a minimum if you can, and use other urlfilters (prefix,
> domain,
> >> suffix, custom) to limit your crawling frontier. There are simply too
> many
> >> ways where a regex engine can lock up.
> >>
> >> Please check the logs of the failing tasks. If you see that a task is
> >> stalled you could also log in to the node, and generate a thread dump a
> few
> >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> regex
> >> processing then it's likely this is your problem.
> >>
> >>
> >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> >>> thought S3 would be more performant, yet it doesn't appear to affect
> >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> >>> to get it done quickly... But I can't imagine I need much more than
> >>> 10-20 mid-size instances for this.
> >>>
> >>
> >> That's correct - with this number of unique sites the max. throughput of
> >> your crawl will be ultimately limited by the politeness limits (# of
> >> requests/site/sec).
> >>
> >>
> >>
> >>> Can anyone share their own experiences in the performance they've
> >>> seen?
> >>>
> >>
> >> There is a very simple benchmark in trunk/ that you could use to measure
> >> the raw performance (data processing throughput) of your EC2 cluster.
> The
> >> real-life performance, though, will depend on many other factors, such
> as
> >> the number of unique sites, their individual speed, and (rarely) the
> total
> >> bandwidth at your end.
> >>
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

Scott Gonyea
I have created a long stream of AMIs, and had planned to build one out that
can be used as a cookie-cutter for everyone else.  Time isn't as long as it
used to be (stupid physics).

Besides scrubbing any client-sensitive residue from such an image, I'd need
to figure out how to tell the slaves to point at the master node.  Right
now, I use an elastic IP and I carve that into the AMI's hadoop
installation.

The AMI, by the way, is basically just your run-of-the-mill Hadoop cluster,
with the various conf settings having been tuned to my
trial-by-instance-hour-billing experiences.

As it stands, if I did create such an AMI--I'd probably create with the
assumption that everyone will happily install Rudy (
http://github.com/solutious/rudy) and have it launch the master, pull out
the IP information for the master node, and then slap that into the hadoop
conf for each slave node.

I don't follow the Nutch+Hadoop tutorial, in that I don't use the
master/slave files.  Rather, the Hadoop slaves just talk to the Job Tracker
and say, "hey, you can use me."  It has the benefit of allowing me to scale
up, should the load be obscene.  However, because of how things balance out,
I can't really scale down so easily.

Actually, I really want to figure out how I can micro-manage Hadoop (as well
as Nutch).  Nutch, because the segments need to be controlled.  Hadoop,
because it would be far nicer to assign work based on a node's 'historical'
performance... and less reliant on hard limits placed into mapred-site.

I may or may not be babbling at this point.

If anyone wants me to put some energy into that AMI, I'll take "votes of
encouragement" in the form of sharing your non-sensitive (ie, domain names,
company-sensitive data) conf files and what your performance results have
been from them.  They can also be sent directly to me, for whatever reason
you'd prefer to do that.

I'll start looking into how best to create a non-suck AMI, but that'll be
largely done in my "nobody thinks my time is worth money" hours of the day,
which my wife sometimes interprets as "there's more work left in the
man-slob."

sg

On Mon, Aug 2, 2010 at 3:08 PM, AJ Chen <[hidden email]> wrote:

> Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
> sample of the complete nutch&hadoop configurations for distributed crawling
> on EC2 is available for the community, that will help anyone to learn nutch
> best practice quickly.
> -aj
>
>
> On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <[hidden email]> wrote:
>
> > By the way, can anyone tell me if there is a way to explicitly limit how
> > many pages should be fetched, per fetcher-task?
> >
> > I know that one limit to this, is that each site/domain/whatever could
> > exceed that limit (assuming the limit were lower than the number of those
> > sites).  For politeness, that limit would have to be soft.  But that's
> more
> > than suitable, in my opinion.
> >
> > I think part of the problem is that, seemingly, Nutch seems to be
> > generating
> > some really unbalanced fetcher tasks.
> >
> > The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.  Each
> > higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> > pages to fetch.
> >
> > This *huge* imbalance, I think, creates tasks that are seemingly
> > unpredictable.  All of my other resources just sit around, wasting
> > resources, until one task grabs some crazy number of sites.
> >
> > Thanks again,
> > sg
> >
> > On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <[hidden email]> wrote:
> >
> > > Thank you very much, Adrzej.  I'm really hoping some people can share
> > some
> > > non-sensitive details of their setup.  I'm really curious about the
> > > following:
> > >
> > > The ratio of Maps to Reduces for their nutch jobs?
> > > The amount of memory that they allocate to each job task?
> > > The number of simultaneous Maps/Reduces on any given host?
> > > The number of fetcher threads they execute?
> > >
> > > Any config setup people can share would be great, so I can have a
> > different
> > > perspective on how people setup their nutch-site and mapred-site files.
> > >
> > > At the moment, I'm experimenting with the following configs:
> > >
> > > http://gist.github.com/505065
> > >
> > > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run
> at
> > > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.
>  Those
> > are
> > > both prime numbers, but I don't know if that really matters.  I've seen
> > > Hadoop say that the number of reducers should be around the number of
> > nodes
> > > you have (the nearest prime).  I've seen, somewhere, some suggestions
> > that
> > > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> > insight
> > > to share on that?
> > >
> > > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.
>  I'm
> > > waiting for it to return to the 4th fetch step, so I can see why Nutch
> > hates
> > > me so much.
> > >
> > > sg
> > >
> > > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]>
> wrote:
> > >
> > >> On 2010-08-02 10:17, Scott Gonyea wrote:
> > >>
> > >>> The big problem that I am facing, thus far, occurs on the 4th fetch.
> > >>> All but 1 or 2 maps complete. All of the running reduces stall (0.00
> > >>> MB/s), presumably because they are waiting on that map to finish? I
> > >>> really don't know and it's frustrating.
> > >>>
> > >>
> > >> Yes, all map tasks need to finish before reduce tasks are able to
> > proceed.
> > >> The reason is that each reduce task receives a portion of the keyspace
> > (and
> > >> values) according to the Partitioner, and in order to prepare a nice
> > <key,
> > >> list(value)> in your reducer it needs to, well, get all the values
> under
> > >> this key first, whichever map task produced the tuples, and then sort
> > them.
> > >>
> > >> The failing tasks probably fail due to some other factor, and very
> > likely
> > >> (based on my experience) the failure is related to some particular
> URLs.
> > >> E.g. regex URL filtering can choke on some pathological URLs, like
> URLs
> > 20kB
> > >> long, or containing '\0' etc, etc. In my experience, it's best to keep
> > regex
> > >> filtering to a minimum if you can, and use other urlfilters (prefix,
> > domain,
> > >> suffix, custom) to limit your crawling frontier. There are simply too
> > many
> > >> ways where a regex engine can lock up.
> > >>
> > >> Please check the logs of the failing tasks. If you see that a task is
> > >> stalled you could also log in to the node, and generate a thread dump
> a
> > few
> > >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> > regex
> > >> processing then it's likely this is your problem.
> > >>
> > >>
> > >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
> > >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> > >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> > >>> thought S3 would be more performant, yet it doesn't appear to affect
> > >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> > >>> to get it done quickly... But I can't imagine I need much more than
> > >>> 10-20 mid-size instances for this.
> > >>>
> > >>
> > >> That's correct - with this number of unique sites the max. throughput
> of
> > >> your crawl will be ultimately limited by the politeness limits (# of
> > >> requests/site/sec).
> > >>
> > >>
> > >>
> > >>> Can anyone share their own experiences in the performance they've
> > >>> seen?
> > >>>
> > >>
> > >> There is a very simple benchmark in trunk/ that you could use to
> measure
> > >> the raw performance (data processing throughput) of your EC2 cluster.
> > The
> > >> real-life performance, though, will depend on many other factors, such
> > as
> > >> the number of unique sites, their individual speed, and (rarely) the
> > total
> > >> bandwidth at your end.
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >> Andrzej Bialecki     <><
> > >>  ___. ___ ___ ___ _ _   __________________________________
> > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > >> http://www.sigram.com  Contact: info at sigram dot com
> > >>
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Seeking Insight into Nutch Configurations

AJ Chen-3
Scott, thank you for your effort. fantastic. -aj

On Mon, Aug 2, 2010 at 3:34 PM, Scott Gonyea <[hidden email]> wrote:

> I have created a long stream of AMIs, and had planned to build one out that
> can be used as a cookie-cutter for everyone else.  Time isn't as long as it
> used to be (stupid physics).
>
> Besides scrubbing any client-sensitive residue from such an image, I'd need
> to figure out how to tell the slaves to point at the master node.  Right
> now, I use an elastic IP and I carve that into the AMI's hadoop
> installation.
>
> The AMI, by the way, is basically just your run-of-the-mill Hadoop cluster,
> with the various conf settings having been tuned to my
> trial-by-instance-hour-billing experiences.
>
> As it stands, if I did create such an AMI--I'd probably create with the
> assumption that everyone will happily install Rudy (
> http://github.com/solutious/rudy) and have it launch the master, pull out
> the IP information for the master node, and then slap that into the hadoop
> conf for each slave node.
>
> I don't follow the Nutch+Hadoop tutorial, in that I don't use the
> master/slave files.  Rather, the Hadoop slaves just talk to the Job Tracker
> and say, "hey, you can use me."  It has the benefit of allowing me to scale
> up, should the load be obscene.  However, because of how things balance
> out,
> I can't really scale down so easily.
>
> Actually, I really want to figure out how I can micro-manage Hadoop (as
> well
> as Nutch).  Nutch, because the segments need to be controlled.  Hadoop,
> because it would be far nicer to assign work based on a node's 'historical'
> performance... and less reliant on hard limits placed into mapred-site.
>
> I may or may not be babbling at this point.
>
> If anyone wants me to put some energy into that AMI, I'll take "votes of
> encouragement" in the form of sharing your non-sensitive (ie, domain names,
> company-sensitive data) conf files and what your performance results have
> been from them.  They can also be sent directly to me, for whatever reason
> you'd prefer to do that.
>
> I'll start looking into how best to create a non-suck AMI, but that'll be
> largely done in my "nobody thinks my time is worth money" hours of the day,
> which my wife sometimes interprets as "there's more work left in the
> man-slob."
>
> sg
>
> On Mon, Aug 2, 2010 at 3:08 PM, AJ Chen <[hidden email]> wrote:
>
> > Does anyone have a EC2 image that runs smoothly for >3000 domains?  If a
> > sample of the complete nutch&hadoop configurations for distributed
> crawling
> > on EC2 is available for the community, that will help anyone to learn
> nutch
> > best practice quickly.
> > -aj
> >
> >
> > On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <[hidden email]> wrote:
> >
> > > By the way, can anyone tell me if there is a way to explicitly limit
> how
> > > many pages should be fetched, per fetcher-task?
> > >
> > > I know that one limit to this, is that each site/domain/whatever could
> > > exceed that limit (assuming the limit were lower than the number of
> those
> > > sites).  For politeness, that limit would have to be soft.  But that's
> > more
> > > than suitable, in my opinion.
> > >
> > > I think part of the problem is that, seemingly, Nutch seems to be
> > > generating
> > > some really unbalanced fetcher tasks.
> > >
> > > The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.
>  Each
> > > higher-numbered task had fewer pages to fetch.  Task 000180 only had 44
> > > pages to fetch.
> > >
> > > This *huge* imbalance, I think, creates tasks that are seemingly
> > > unpredictable.  All of my other resources just sit around, wasting
> > > resources, until one task grabs some crazy number of sites.
> > >
> > > Thanks again,
> > > sg
> > >
> > > On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <[hidden email]> wrote:
> > >
> > > > Thank you very much, Adrzej.  I'm really hoping some people can share
> > > some
> > > > non-sensitive details of their setup.  I'm really curious about the
> > > > following:
> > > >
> > > > The ratio of Maps to Reduces for their nutch jobs?
> > > > The amount of memory that they allocate to each job task?
> > > > The number of simultaneous Maps/Reduces on any given host?
> > > > The number of fetcher threads they execute?
> > > >
> > > > Any config setup people can share would be great, so I can have a
> > > different
> > > > perspective on how people setup their nutch-site and mapred-site
> files.
> > > >
> > > > At the moment, I'm experimenting with the following configs:
> > > >
> > > > http://gist.github.com/505065
> > > >
> > > > I'm giving each task 2048m of memory.  Up to 5 Maps and 2 Reduces run
> > at
> > > > any given time.  I have Nutch firing off 181 Maps and 41 Reduces.
> >  Those
> > > are
> > > > both prime numbers, but I don't know if that really matters.  I've
> seen
> > > > Hadoop say that the number of reducers should be around the number of
> > > nodes
> > > > you have (the nearest prime).  I've seen, somewhere, some suggestions
> > > that
> > > > Nutch maps/reduces be anywhere from 1:0.93-1:1.25.  Does anyone have
> > > insight
> > > > to share on that?
> > > >
> > > > Thank you, Andrzej for the SIGQUIT suggestion.  I forgot about that.
> >  I'm
> > > > waiting for it to return to the 4th fetch step, so I can see why
> Nutch
> > > hates
> > > > me so much.
> > > >
> > > > sg
> > > >
> > > > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[hidden email]>
> > wrote:
> > > >
> > > >> On 2010-08-02 10:17, Scott Gonyea wrote:
> > > >>
> > > >>> The big problem that I am facing, thus far, occurs on the 4th
> fetch.
> > > >>> All but 1 or 2 maps complete. All of the running reduces stall
> (0.00
> > > >>> MB/s), presumably because they are waiting on that map to finish? I
> > > >>> really don't know and it's frustrating.
> > > >>>
> > > >>
> > > >> Yes, all map tasks need to finish before reduce tasks are able to
> > > proceed.
> > > >> The reason is that each reduce task receives a portion of the
> keyspace
> > > (and
> > > >> values) according to the Partitioner, and in order to prepare a nice
> > > <key,
> > > >> list(value)> in your reducer it needs to, well, get all the values
> > under
> > > >> this key first, whichever map task produced the tuples, and then
> sort
> > > them.
> > > >>
> > > >> The failing tasks probably fail due to some other factor, and very
> > > likely
> > > >> (based on my experience) the failure is related to some particular
> > URLs.
> > > >> E.g. regex URL filtering can choke on some pathological URLs, like
> > URLs
> > > 20kB
> > > >> long, or containing '\0' etc, etc. In my experience, it's best to
> keep
> > > regex
> > > >> filtering to a minimum if you can, and use other urlfilters (prefix,
> > > domain,
> > > >> suffix, custom) to limit your crawling frontier. There are simply
> too
> > > many
> > > >> ways where a regex engine can lock up.
> > > >>
> > > >> Please check the logs of the failing tasks. If you see that a task
> is
> > > >> stalled you could also log in to the node, and generate a thread
> dump
> > a
> > > few
> > > >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the
> > > regex
> > > >> processing then it's likely this is your problem.
> > > >>
> > > >>
> > > >>  My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content:
> Text
> > > >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
> > > >>> Storage: I've performed crawls with HDFS and with amazon S3. I
> > > >>> thought S3 would be more performant, yet it doesn't appear to
> affect
> > > >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this
> > > >>> to get it done quickly... But I can't imagine I need much more than
> > > >>> 10-20 mid-size instances for this.
> > > >>>
> > > >>
> > > >> That's correct - with this number of unique sites the max.
> throughput
> > of
> > > >> your crawl will be ultimately limited by the politeness limits (# of
> > > >> requests/site/sec).
> > > >>
> > > >>
> > > >>
> > > >>> Can anyone share their own experiences in the performance they've
> > > >>> seen?
> > > >>>
> > > >>
> > > >> There is a very simple benchmark in trunk/ that you could use to
> > measure
> > > >> the raw performance (data processing throughput) of your EC2
> cluster.
> > > The
> > > >> real-life performance, though, will depend on many other factors,
> such
> > > as
> > > >> the number of unique sites, their individual speed, and (rarely) the
> > > total
> > > >> bandwidth at your end.
> > > >>
> > > >>
> > > >> --
> > > >> Best regards,
> > > >> Andrzej Bialecki     <><
> > > >>  ___. ___ ___ ___ _ _   __________________________________
> > > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > >> http://www.sigram.com  Contact: info at sigram dot com
> > > >>
> > > >>
> > > >
> > >
> >
>