Fetching inefficiency

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Fetching inefficiency

Otis Gospodnetic-2-2
Hello,

I am wondering how others deal with the following, which I see as fetching inefficiency:


When fetching, the fetchlist is broken up into multiple parts and fetchers on cluster nodes start fetching.  Some fetchers end up fetching from fast servers, and some from very very slow servers.  Those fetching from slow servers take a long time to complete and prolong the whole fetching process.  For instance, I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 10 hours.  Those taking 10 hours were stuck fetching pages from a single or handful of slow sites.  If you have two nodes doing the fetching and one is stuck with a slow server, the other one is idling and wasting time.  The node stuck with the slow server is also underutilized, as it's slowly fetching from only 1 server instead of many.

I imagine anyone using Nutch is seeing the same.  If not, what's the trick?

I have not tried overlapping fetching jobs yet, but I have a feeling that won't help a ton, plus it could lead to two fetchers fetching from the same server and being impolite - am I wrong?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Siddhartha Reddy
I do face a similar problem. I occasionally have some fetch jobs that are
fetching from less than 100 hosts, the effect is magnified in this case.

I have found one workaround for this but I am not sure if this is the best
possible solution: I set the value of generate.max.per.host to a pretty
small value (like 1000) and this reduces the maximum amount of time any task
is going to be held up due to a particular host. This does increase the
number of cycles that are needed to finish a crawl but does solve the
mentioned problem. It might even make sense to have an even lower value -- I
am still experimenting to find a good value myself.

In addition, I think NUTCH-629 and NUTCH-570 could help reduce the effects
of the problem caused by slow servers.

Best,
Siddhartha Reddy

On Tue, Apr 22, 2008 at 1:46 AM, <[hidden email]> wrote:

> Hello,
>
> I am wondering how others deal with the following, which I see as fetching
> inefficiency:
>
>
> When fetching, the fetchlist is broken up into multiple parts and fetchers
> on cluster nodes start fetching.  Some fetchers end up fetching from fast
> servers, and some from very very slow servers.  Those fetching from slow
> servers take a long time to complete and prolong the whole fetching process.
>  For instance, I've seen tasks from the same fetch job finish in only 1-2
> hours, and others in 10 hours.  Those taking 10 hours were stuck fetching
> pages from a single or handful of slow sites.  If you have two nodes doing
> the fetching and one is stuck with a slow server, the other one is idling
> and wasting time.  The node stuck with the slow server is also
> underutilized, as it's slowly fetching from only 1 server instead of many.
>
> I imagine anyone using Nutch is seeing the same.  If not, what's the
> trick?
>
> I have not tried overlapping fetching jobs yet, but I have a feeling that
> won't help a ton, plus it could lead to two fetchers fetching from the same
> server and being impolite - am I wrong?
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>


--
http://sids.in
"If you are not having fun, you are not doing it right."
Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Dennis Kubes-2
In reply to this post by Otis Gospodnetic-2-2
This may not be applicable to what you are doing but for a whole web
crawl we tend to separate deep crawl sites and shallow crawl sites.
Shallow crawl which is most of the web get a max of 50 pages set via the
generate.max.per.host config variable.  A deep crawl would contain only
a list of deep crawl sites, say wikipedia or cnn, and would be limited
by url filters and be allowed unlimited urls.  A deep crawl would run
through a number of fetch cycles, say a depth of 3-5.

Dennis

[hidden email] wrote:

> Hello,
>
> I am wondering how others deal with the following, which I see as fetching inefficiency:
>
>
> When fetching, the fetchlist is broken up into multiple parts and fetchers on cluster nodes start fetching.  Some fetchers end up fetching from fast servers, and some from very very slow servers.  Those fetching from slow servers take a long time to complete and prolong the whole fetching process.  For instance, I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 10 hours.  Those taking 10 hours were stuck fetching pages from a single or handful of slow sites.  If you have two nodes doing the fetching and one is stuck with a slow server, the other one is idling and wasting time.  The node stuck with the slow server is also underutilized, as it's slowly fetching from only 1 server instead of many.
>
> I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
>
> I have not tried overlapping fetching jobs yet, but I have a feeling that won't help a ton, plus it could lead to two fetchers fetching from the same server and being impolite - am I wrong?
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2-2
Hi Dennis,

Ah, interesting, this is one of the things that was in the back of my mind, too - finding a way to "even out" the fetchlists, so that, if I can't figure out which servers are slow, I can at least get approximately equal number of pages from each site in the fetchlist.  It looks like you have two groups of sites - sites with a pile of pages that you want to crawl fully (deep, the head), and sites from which you are willing to fetch only a small number of pages.  This way you end up with 2 types of fetchlists, each with roughly equal number of pages from each site.  Did I get that right?

Question: how do you generate these two different types of fetchlists?  Same "generate" run, but with different urlfilter (prefix- or regex-urlfilter)? configs?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Dennis Kubes <[hidden email]>
> To: [hidden email]
> Sent: Monday, April 21, 2008 7:43:43 PM
> Subject: Re: Fetching inefficiency
>
> This may not be applicable to what you are doing but for a whole web
> crawl we tend to separate deep crawl sites and shallow crawl sites.
> Shallow crawl which is most of the web get a max of 50 pages set via the
> generate.max.per.host config variable.  A deep crawl would contain only
> a list of deep crawl sites, say wikipedia or cnn, and would be limited
> by url filters and be allowed unlimited urls.  A deep crawl would run
> through a number of fetch cycles, say a depth of 3-5.
>
> Dennis
>
> [hidden email] wrote:
> > Hello,
> >
> > I am wondering how others deal with the following, which I see as fetching
> inefficiency:
> >
> >
> > When fetching, the fetchlist is broken up into multiple parts and fetchers on
> cluster nodes start fetching.  Some fetchers end up fetching from fast servers,
> and some from very very slow servers.  Those fetching from slow servers take a
> long time to complete and prolong the whole fetching process.  For instance,
> I've seen tasks from the same fetch job finish in only 1-2 hours, and others in
> 10 hours.  Those taking 10 hours were stuck fetching pages from a single or
> handful of slow sites.  If you have two nodes doing the fetching and one is
> stuck with a slow server, the other one is idling and wasting time.  The node
> stuck with the slow server is also underutilized, as it's slowly fetching from
> only 1 server instead of many.
> >
> > I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
> >
> > I have not tried overlapping fetching jobs yet, but I have a feeling that
> won't help a ton, plus it could lead to two fetchers fetching from the same
> server and being impolite - am I wrong?
> >
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >

Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Dennis Kubes-2
We do it the old fashioned way :).  The deep crawl is a separate crawldb
with a manually injected list of urls.  Shallow crawl is a regular full
web crawl.  They can have overlapping urls, cnn.com for example. Shallow
will only fetch 50 pages, deep is unlimited up to the number of urls for
a given shard.  These two are then merged together at the crawldb level.

And yes we define the number of pages per shard, even in the deep crawls
through the topN parameter on the generator for fetchlists.  It is
approximate, and because we are using the automated python jobstream it
grabs the *best* urls first for each fetch, there is the problem of url
degradation.

What I mean by this is later fetches even though they are the same
initial fetchlist size will tend to have less urls which are good and
actually fetched.  So lets say we have 40 shards each with a 2M  page
generate list.  The first ones might fetch 1.95M pages good.  The 40th
one might only fetch 1M pages good.  As best we can tell, this is simply
bad urls.  As scores get lower for continued crawls you tend to get more
urls that are simply not fetchable.  But since the number of urls per
shard is set in generator, we haven't found a way around this.

Dennis

[hidden email] wrote:

> Hi Dennis,
>
> Ah, interesting, this is one of the things that was in the back of my mind, too - finding a way to "even out" the fetchlists, so that, if I can't figure out which servers are slow, I can at least get approximately equal number of pages from each site in the fetchlist.  It looks like you have two groups of sites - sites with a pile of pages that you want to crawl fully (deep, the head), and sites from which you are willing to fetch only a small number of pages.  This way you end up with 2 types of fetchlists, each with roughly equal number of pages from each site.  Did I get that right?
>
> Question: how do you generate these two different types of fetchlists?  Same "generate" run, but with different urlfilter (prefix- or regex-urlfilter)? configs?
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
>> From: Dennis Kubes <[hidden email]>
>> To: [hidden email]
>> Sent: Monday, April 21, 2008 7:43:43 PM
>> Subject: Re: Fetching inefficiency
>>
>> This may not be applicable to what you are doing but for a whole web
>> crawl we tend to separate deep crawl sites and shallow crawl sites.
>> Shallow crawl which is most of the web get a max of 50 pages set via the
>> generate.max.per.host config variable.  A deep crawl would contain only
>> a list of deep crawl sites, say wikipedia or cnn, and would be limited
>> by url filters and be allowed unlimited urls.  A deep crawl would run
>> through a number of fetch cycles, say a depth of 3-5.
>>
>> Dennis
>>
>> [hidden email] wrote:
>>> Hello,
>>>
>>> I am wondering how others deal with the following, which I see as fetching
>> inefficiency:
>>>
>>> When fetching, the fetchlist is broken up into multiple parts and fetchers on
>> cluster nodes start fetching.  Some fetchers end up fetching from fast servers,
>> and some from very very slow servers.  Those fetching from slow servers take a
>> long time to complete and prolong the whole fetching process.  For instance,
>> I've seen tasks from the same fetch job finish in only 1-2 hours, and others in
>> 10 hours.  Those taking 10 hours were stuck fetching pages from a single or
>> handful of slow sites.  If you have two nodes doing the fetching and one is
>> stuck with a slow server, the other one is idling and wasting time.  The node
>> stuck with the slow server is also underutilized, as it's slowly fetching from
>> only 1 server instead of many.
>>> I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
>>>
>>> I have not tried overlapping fetching jobs yet, but I have a feeling that
>> won't help a ton, plus it could lead to two fetchers fetching from the same
>> server and being impolite - am I wrong?
>>> Thanks,
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2-2
Siddhartha,

I think decreasing generate.max.per.host will limit the 'wait time' for each fetch run, but I have a feeling that the overall time will be roughly the same.  As a matter of fact, it may be even higher, because you'll have to run generate more times, and if your fetch jobs are too short, you will be spending more time waiting on MapReduce jobs (JVM instantiation, job initialization....)


Have you tried NUTCH-570?  I know it doesn't break anything, but I have not been able to see its positive effects - likely because my fetch cycles are dominated by those slow servers with lots of pages and not by wait time between subsequent requests to the same server.  But I'd love to hear if others found NUTCH-570 helpful!

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Siddhartha Reddy <[hidden email]>
> To: [hidden email]
> Sent: Monday, April 21, 2008 4:59:03 PM
> Subject: Re: Fetching inefficiency
>
> I do face a similar problem. I occasionally have some fetch jobs that are
> fetching from less than 100 hosts, the effect is magnified in this case.
>
> I have found one workaround for this but I am not sure if this is the best
> possible solution: I set the value of generate.max.per.host to a pretty
> small value (like 1000) and this reduces the maximum amount of time any task
> is going to be held up due to a particular host. This does increase the
> number of cycles that are needed to finish a crawl but does solve the
> mentioned problem. It might even make sense to have an even lower value -- I
> am still experimenting to find a good value myself.
>
> In addition, I think NUTCH-629 and NUTCH-570 could help reduce the effects
> of the problem caused by slow servers.
>
> Best,
> Siddhartha Reddy
>
> On Tue, Apr 22, 2008 at 1:46 AM, wrote:
>
> > Hello,
> >
> > I am wondering how others deal with the following, which I see as fetching
> > inefficiency:
> >
> >
> > When fetching, the fetchlist is broken up into multiple parts and fetchers
> > on cluster nodes start fetching.  Some fetchers end up fetching from fast
> > servers, and some from very very slow servers.  Those fetching from slow
> > servers take a long time to complete and prolong the whole fetching process.
> >  For instance, I've seen tasks from the same fetch job finish in only 1-2
> > hours, and others in 10 hours.  Those taking 10 hours were stuck fetching
> > pages from a single or handful of slow sites.  If you have two nodes doing
> > the fetching and one is stuck with a slow server, the other one is idling
> > and wasting time.  The node stuck with the slow server is also
> > underutilized, as it's slowly fetching from only 1 server instead of many.
> >
> > I imagine anyone using Nutch is seeing the same.  If not, what's the
> > trick?
> >
> > I have not tried overlapping fetching jobs yet, but I have a feeling that
> > won't help a ton, plus it could lead to two fetchers fetching from the same
> > server and being impolite - am I wrong?
> >
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
>
>
> --
> http://sids.in
> "If you are not having fun, you are not doing it right."

Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Siddhartha Reddy
I have observed a significant improvement after setting
generate.max.per.host to 1000. Earlier, one of my fetch job for a few
thousand pages went on for days because of a couple of sites that were too
slow. For the same crawl, I am now using a generate.max.per.host of 1000 and
each fetch job finishes in about 3hrs for around 30,000 pages while the
other jobs -- generate, parse, updatedb -- take up another hour.

You are right about the additional overhead of having more generate jobs. I
am now planning to parallelize the generate jobs with fetch (by using
numFetchers that is less then the number of map tasks available) and am
hoping that it would offset the time for the additional generates.

The cost of setting up the MapReduce jobs might in fact become a significant
one if I reduce the generate.max.per.hosts even further (or it might even be
quite a lot and I am just not noticing.) I will be doing some
experimentation to find the optimum point; but the results might be too
specific to my current crawl.

On my first attempt, I could not apply the NUTCH-570 patch, so I left it for
later. Anyways, as long as I am using a small generate.max.per.host I doubt
that it would help much.

I am using NUTCH-629 but I am not sure how to measure if it is offering any
improvements.

Best,
Siddhartha

On Wed, Apr 23, 2008 at 9:29 AM, <[hidden email]> wrote:

> Siddhartha,
>
> I think decreasing generate.max.per.host will limit the 'wait time' for
> each fetch run, but I have a feeling that the overall time will be roughly
> the same.  As a matter of fact, it may be even higher, because you'll have
> to run generate more times, and if your fetch jobs are too short, you will
> be spending more time waiting on MapReduce jobs (JVM instantiation, job
> initialization....)
>
>
> Have you tried NUTCH-570?  I know it doesn't break anything, but I have
> not been able to see its positive effects - likely because my fetch cycles
> are dominated by those slow servers with lots of pages and not by wait time
> between subsequent requests to the same server.  But I'd love to hear if
> others found NUTCH-570 helpful!
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
> > From: Siddhartha Reddy <[hidden email]>
> > To: [hidden email]
> > Sent: Monday, April 21, 2008 4:59:03 PM
> > Subject: Re: Fetching inefficiency
> >
> > I do face a similar problem. I occasionally have some fetch jobs that
> are
> > fetching from less than 100 hosts, the effect is magnified in this case.
> >
> > I have found one workaround for this but I am not sure if this is the
> best
> > possible solution: I set the value of generate.max.per.host to a pretty
> > small value (like 1000) and this reduces the maximum amount of time any
> task
> > is going to be held up due to a particular host. This does increase the
> > number of cycles that are needed to finish a crawl but does solve the
> > mentioned problem. It might even make sense to have an even lower value
> -- I
> > am still experimenting to find a good value myself.
> >
> > In addition, I think NUTCH-629 and NUTCH-570 could help reduce the
> effects
> > of the problem caused by slow servers.
> >
> > Best,
> > Siddhartha Reddy
> >
> > On Tue, Apr 22, 2008 at 1:46 AM, wrote:
> >
> > > Hello,
> > >
> > > I am wondering how others deal with the following, which I see as
> fetching
> > > inefficiency:
> > >
> > >
> > > When fetching, the fetchlist is broken up into multiple parts and
> fetchers
> > > on cluster nodes start fetching.  Some fetchers end up fetching from
> fast
> > > servers, and some from very very slow servers.  Those fetching from
> slow
> > > servers take a long time to complete and prolong the whole fetching
> process.
> > >  For instance, I've seen tasks from the same fetch job finish in only
> 1-2
> > > hours, and others in 10 hours.  Those taking 10 hours were stuck
> fetching
> > > pages from a single or handful of slow sites.  If you have two nodes
> doing
> > > the fetching and one is stuck with a slow server, the other one is
> idling
> > > and wasting time.  The node stuck with the slow server is also
> > > underutilized, as it's slowly fetching from only 1 server instead of
> many.
> > >
> > > I imagine anyone using Nutch is seeing the same.  If not, what's the
> > > trick?
> > >
> > > I have not tried overlapping fetching jobs yet, but I have a feeling
> that
> > > won't help a ton, plus it could lead to two fetchers fetching from the
> same
> > > server and being impolite - am I wrong?
> > >
> > > Thanks,
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> >
> >
> > --
> > http://sids.in
> > "If you are not having fun, you are not doing it right."
>
>


--
http://sids.in
"If you are not having fun, you are not doing it right."
Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Andrzej Białecki-2
In reply to this post by Otis Gospodnetic-2-2
[hidden email] wrote:
> Siddhartha,
>

> I think decreasing generate.max.per.host will limit the 'wait time'
> for each fetch run, but I have a feeling that the overall time will
> be roughly the same.  As a matter of fact, it may be even higher,
> because you'll have to run generate more times, and if your fetch
> jobs are too short, you will be spending more time waiting on
> MapReduce jobs (JVM instantiation, job initialization....)

That's correct in case of very short jobs. In case of longer jobs and
fetchlists consisting of many urls from the same hosts, the fetch time
will be dominated by 'wait time'.

A different point of view on the effects of generate.max.per.host is
that it gives a better chance to smaller hosts to be included in a
fetchlist - otherwise fetchlists would be dominated by urls from large
hosts. So, in a sense it helps to differentiate your crawling frontier,
with a silent assumption that N pages from X hosts is more interesting
than the same N pages from a single host.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2-2
Hi,


----- Original Message ----

> From: Andrzej Bialecki <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, April 23, 2008 4:23:44 AM
> Subject: Re: Fetching inefficiency
>
> [hidden email] wrote:
> > Siddhartha,
> >
>
> > I think decreasing generate.max.per.host will limit the 'wait time'
> > for each fetch run, but I have a feeling that the overall time will
> > be roughly the same.  As a matter of fact, it may be even higher,
> > because you'll have to run generate more times, and if your fetch
> > jobs are too short, you will be spending more time waiting on
> > MapReduce jobs (JVM instantiation, job initialization....)
>
> That's correct in case of very short jobs. In case of longer jobs and
> fetchlists consisting of many urls from the same hosts, the fetch time
> will be dominated by 'wait time'.
>
> A different point of view on the effects of generate.max.per.host is
> that it gives a better chance to smaller hosts to be included in a
> fetchlist - otherwise fetchlists would be dominated by urls from large
> hosts. So, in a sense it helps to differentiate your crawling frontier,
> with a silent assumption that N pages from X hosts is more interesting
> than the same N pages from a single host.

Si, si!
I think even the above assumes that you have so many pages
that are ready to be fetched from large hosts, that if you let them all get
into the fetchlist, there would be no room for sites with fewer pages.
That is, it assumes -topN is being used and that N would be hit if you
didn't limit per-host-URLs with generate.max.per.host.

However, there is also a "in-between" situation, where you have this
group of sites with lots of pages (some potentially slow), and sites with
fewer pages (the pages-per-host distribution must have the "long tail"
curve), but all together there are not enough of them to reach -topN.


I think that in that case limiting with generate.max.per.host won't have the
nice benefit of winder crawl frontier host distribution.... but this is really all
theoretical.  I am actually not hitting this issue.
 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2-2
Hi,

 ----- Original Message ----

> From: Siddhartha Reddy <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, April 23, 2008 12:49:07 AM
> Subject: Re: Fetching inefficiency
>
> I have observed a significant improvement after setting
> generate.max.per.host to 1000. Earlier, one of my fetch job for a few
> thousand pages went on for days because of a couple of sites that were too
> slow. For the same crawl, I am now using a generate.max.per.host of 1000 and
> each fetch job finishes in about 3hrs for around 30,000 pages while the
> other jobs -- generate, parse, updatedb -- take up another hour.
>
> You are right about the additional overhead of having more generate jobs. I
> am now planning to parallelize the generate jobs with fetch (by using
> numFetchers that is less then the number of map tasks available) and am
> hoping that it would offset the time for the additional generates.

Great.  Could you please let us know if using the recipe on
http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much, roughly?

> The cost of setting up the MapReduce jobs might in fact become a significant
> one if I reduce the generate.max.per.hosts even further (or it might even be
> quite a lot and I am just not noticing.) I will be doing some
> experimentation to find the optimum point; but the results might be too
> specific to my current crawl.
>
> On my first attempt, I could not apply the NUTCH-570 patch, so I left it for
> later. Anyways, as long as I am using a small generate.max.per.host I doubt
> that it would help much.

I can send you my Generator.java, if you want, it has NUTCH-570 and a few other
little changes.

> I am using NUTCH-629 but I am not sure how to measure if it is offering any
> improvements.

I think the same way you described in the first paragraph - by looking at the
total time it took for the fetch job to complete, or perhaps simply by looking at
pg/sec rates and eyeballing.  The idea there is that if requests to a host keep
timing out, there is no point in wasting time requesting more pages from it.
This really only pays off if hosts with lots of URLs in the fetchlists time out.
There is no point in dropping hosts with only a few URLs, as even with time outs
those will be processed quickly.  It is those with lots of pages and that keep
timing out that are the problem.  So you should see the greatest benefit in
those cases.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


> On Wed, Apr 23, 2008 at 9:29 AM, wrote:
>
> > Siddhartha,
> >
> > I think decreasing generate.max.per.host will limit the 'wait time' for
> > each fetch run, but I have a feeling that the overall time will be roughly
> > the same.  As a matter of fact, it may be even higher, because you'll have
> > to run generate more times, and if your fetch jobs are too short, you will
> > be spending more time waiting on MapReduce jobs (JVM instantiation, job
> > initialization....)
> >
> >
> > Have you tried NUTCH-570?  I know it doesn't break anything, but I have
> > not been able to see its positive effects - likely because my fetch cycles
> > are dominated by those slow servers with lots of pages and not by wait time
> > between subsequent requests to the same server.  But I'd love to hear if
> > others found NUTCH-570 helpful!
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > ----- Original Message ----
> > > From: Siddhartha Reddy
> > > To: [hidden email]
> > > Sent: Monday, April 21, 2008 4:59:03 PM
> > > Subject: Re: Fetching inefficiency
> > >
> > > I do face a similar problem. I occasionally have some fetch jobs that
> > are
> > > fetching from less than 100 hosts, the effect is magnified in this case.
> > >
> > > I have found one workaround for this but I am not sure if this is the
> > best
> > > possible solution: I set the value of generate.max.per.host to a pretty
> > > small value (like 1000) and this reduces the maximum amount of time any
> > task
> > > is going to be held up due to a particular host. This does increase the
> > > number of cycles that are needed to finish a crawl but does solve the
> > > mentioned problem. It might even make sense to have an even lower value
> > -- I
> > > am still experimenting to find a good value myself.
> > >
> > > In addition, I think NUTCH-629 and NUTCH-570 could help reduce the
> > effects
> > > of the problem caused by slow servers.
> > >
> > > Best,
> > > Siddhartha Reddy
> > >
> > > On Tue, Apr 22, 2008 at 1:46 AM, wrote:
> > >
> > > > Hello,
> > > >
> > > > I am wondering how others deal with the following, which I see as
> > fetching
> > > > inefficiency:
> > > >
> > > >
> > > > When fetching, the fetchlist is broken up into multiple parts and
> > fetchers
> > > > on cluster nodes start fetching.  Some fetchers end up fetching from
> > fast
> > > > servers, and some from very very slow servers.  Those fetching from
> > slow
> > > > servers take a long time to complete and prolong the whole fetching
> > process.
> > > >  For instance, I've seen tasks from the same fetch job finish in only
> > 1-2
> > > > hours, and others in 10 hours.  Those taking 10 hours were stuck
> > fetching
> > > > pages from a single or handful of slow sites.  If you have two nodes
> > doing
> > > > the fetching and one is stuck with a slow server, the other one is
> > idling
> > > > and wasting time.  The node stuck with the slow server is also
> > > > underutilized, as it's slowly fetching from only 1 server instead of
> > many.
> > > >
> > > > I imagine anyone using Nutch is seeing the same.  If not, what's the
> > > > trick?
> > > >
> > > > I have not tried overlapping fetching jobs yet, but I have a feeling
> > that
> > > > won't help a ton, plus it could lead to two fetchers fetching from the
> > same
> > > > server and being impolite - am I wrong?
> > > >
> > > > Thanks,
> > > > Otis
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > >
> > >
> > >
> > > --
> > > http://sids.in
> > > "If you are not having fun, you are not doing it right."
> >
> >
>
>
> --
> http://sids.in
> "If you are not having fun, you are not doing it right."

Reply | Threaded
Open this post in threaded view
|

Extracting Embedded Outlinks

Brian Ulicny
In reply to this post by Andrzej Białecki-2
I'm trying to extract outlinks to embedded youtube videos encoded as
below, using a post Nutch 0.9 system.
 
<object width="425" height="355"><param name="movie"
value="http://www.youtube.com/v/8iYRjK2KSps&rel=1"></param><param
name="wmode" value="transparent"></param><embed
src="http://www.youtube.com/v/8iYRjK2KSps&rel=1"
type="application/x-shockwave-flash" wmode="transparent" width="425"
height="355"></embed></object>

<embed src="http://www.youtube.com/v/A1_GQ-K7P_w&amp;rel=" width="425"
height="355" type="application/x-shockwave-flash"
wmode="transparent"></embed>

I modified DOMContentUtils.java as follows:

  public void setConf(Configuration conf) {
   + System.out.println("setting linkparams conf");
    this.conf = conf;
    linkParams.clear();
    linkParams.put("a", new LinkParams("a", "href", 1));
   + linkParams.put("embed", new LinkParams("embed","source", 0));
   + linkParams.put("object", new LinkParams("object", "movie", 2));
    linkParams.put("area", new LinkParams("area", "href", 0));
    if (conf.getBoolean("parser.html.form.use_action", false)) {
      linkParams.put("form", new LinkParams("form", "action", 1));
    }
    linkParams.put("frame", new LinkParams("frame", "src", 0));
    linkParams.put("iframe", new LinkParams("iframe", "src", 0));
    linkParams.put("script", new LinkParams("script", "src", 0));
    linkParams.put("link", new LinkParams("link", "href", 0));
    linkParams.put("img", new LinkParams("img", "src", 0));
  }

But nothing happens.  These links are always ignored.  In fact, the
print statement never prints.

How can I extract these outlinks?

Brian


--
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746


Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2-2
In any case, I think the end goal would be to have per-host coefficients
used when generating fetchlists.  For example:


maxPerHost = 1000;
superSlowMaxPerHost = maxPerHost * 0.1
slowMaxPerHost = maxPerHost * 0.5
avgMaxPerHost = maxPerHost * 1.0
fastMaxPerHost = maxPerHost * 1.5

perHostCounts = new HashMap(host, AtomicInteger)
HostDatum hd = hostdb.get(host)
dlSpeed = hd.downloadSpeed()
hostCount = perHostCounts.get(host)
if (dlSpeed < 100 && hostCount > superSlowMaxPerHost)
  // we have enough URLs for this host
  return/continue
else if (dlSpeed < 200 && hostCount > slowMaxPerHost)
  // we have enough URLs for this host
  return/continue
else if
  ....

perHostCounts(host, hostCount++)
emit URL

Do others agree that the above is the goal and that it would help with
fetching efficiency by balancing the fetchlists better?  I believe the above
would help us get way from fetchlists like:
slow.com/1
slow.com/2
fast.com/1
...
slow.com/N (N is big)
fast.com/2

And get fetchlist that are more like this (only a few URLs from slow sites
and more URLs from fast sites):

slow.com/1
slow.com/2
fast.com/1
fast.com/2
fast.com/N (N is big)


I made some HostDb progress last night, though I'm unsure what
to do with hostdb.get(host) other than to load all host data into memory
in a MapReduce job and do host lookups against that.  Andrzej
provided some pointers, but reading those at 1-2 AM doesn't work...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: "[hidden email]" <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, April 23, 2008 11:22:12 AM
> Subject: Re: Fetching inefficiency
>
> Hi,
>
>
> ----- Original Message ----
> > From: Andrzej Bialecki
> > To: [hidden email]
> > Sent: Wednesday, April 23, 2008 4:23:44 AM
> > Subject: Re: Fetching inefficiency
> >
> > [hidden email] wrote:
> > > Siddhartha,
> > >
> >
> > > I think decreasing generate.max.per.host will limit the 'wait time'
> > > for each fetch run, but I have a feeling that the overall time will
> > > be roughly the same.  As a matter of fact, it may be even higher,
> > > because you'll have to run generate more times, and if your fetch
> > > jobs are too short, you will be spending more time waiting on
> > > MapReduce jobs (JVM instantiation, job initialization....)
> >
> > That's correct in case of very short jobs. In case of longer jobs and
> > fetchlists consisting of many urls from the same hosts, the fetch time
> > will be dominated by 'wait time'.
> >
> > A different point of view on the effects of generate.max.per.host is
> > that it gives a better chance to smaller hosts to be included in a
> > fetchlist - otherwise fetchlists would be dominated by urls from large
> > hosts. So, in a sense it helps to differentiate your crawling frontier,
> > with a silent assumption that N pages from X hosts is more interesting
> > than the same N pages from a single host.
>
> Si, si!
> I think even the above assumes that you have so many pages
> that are ready to be fetched from large hosts, that if you let them all get
> into the fetchlist, there would be no room for sites with fewer pages.
> That is, it assumes -topN is being used and that N would be hit if you
> didn't limit per-host-URLs with generate.max.per.host.
>
> However, there is also a "in-between" situation, where you have this
> group of sites with lots of pages (some potentially slow), and sites with
> fewer pages (the pages-per-host distribution must have the "long tail"
> curve), but all together there are not enough of them to reach -topN.
>
>
> I think that in that case limiting with generate.max.per.host won't have the
> nice benefit of winder crawl frontier host distribution.... but this is really
> all
> theoretical.  I am actually not hitting this issue.
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply | Threaded
Open this post in threaded view
|

RE: Extracting Embedded Outlinks

Howie Wang
In reply to this post by Brian Ulicny

Never looked at this bit of code, but for the example
you provided with <embed src="...">, shouldn't the
code be:

   + linkParams.put("embed", new LinkParams("embed","src", 0));

not

   + linkParams.put("embed", new LinkParams("embed","source", 0));

Howie


> From: [hidden email]
> To: [hidden email]; [hidden email]
> Subject: Extracting Embedded Outlinks
> Date: Wed, 23 Apr 2008 11:45:40 -0400
>
> I'm trying to extract outlinks to embedded youtube videos encoded as
> below, using a post Nutch 0.9 system.
>  
> <object width="425" height="355"><param name="movie"
> value="http://www.youtube.com/v/8iYRjK2KSps&rel=1"></param><param
> name="wmode" value="transparent"></param><embed
> src="http://www.youtube.com/v/8iYRjK2KSps&rel=1"
> type="application/x-shockwave-flash" wmode="transparent" width="425"
> height="355"></embed></object>
>
> <embed src="http://www.youtube.com/v/A1_GQ-K7P_w&amp;rel=" width="425"
> height="355" type="application/x-shockwave-flash"
> wmode="transparent"></embed>
>
> I modified DOMContentUtils.java as follows:
>
>   public void setConf(Configuration conf) {
>    + System.out.println("setting linkparams conf");
>     this.conf = conf;
>     linkParams.clear();
>     linkParams.put("a", new LinkParams("a", "href", 1));
>    + linkParams.put("embed", new LinkParams("embed","source", 0));
>    + linkParams.put("object", new LinkParams("object", "movie", 2));
>     linkParams.put("area", new LinkParams("area", "href", 0));
>     if (conf.getBoolean("parser.html.form.use_action", false)) {
>       linkParams.put("form", new LinkParams("form", "action", 1));
>     }
>     linkParams.put("frame", new LinkParams("frame", "src", 0));
>     linkParams.put("iframe", new LinkParams("iframe", "src", 0));
>     linkParams.put("script", new LinkParams("script", "src", 0));
>     linkParams.put("link", new LinkParams("link", "href", 0));
>     linkParams.put("img", new LinkParams("img", "src", 0));
>   }
>
> But nothing happens.  These links are always ignored.  In fact, the
> print statement never prints.
>
> How can I extract these outlinks?
>
> Brian
>
>
> --
>   Brian Ulicny
>   bulicny at alum dot mit dot edu
>   home: 781-721-5746
>   fax: 360-361-5746
>
>

_________________________________________________________________
In a rush? Get real-time answers with Windows Live Messenger.
http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_realtime_042008
Reply | Threaded
Open this post in threaded view
|

RE: Extracting Embedded Outlinks

Brian Ulicny
You are right, although that doesn't fix the problem, either.

Should be 'src', not 'source' below, presumably.

Fixed, recompiled, and ran again: no youtube links.

Brian

On Wed, 23 Apr 2008 17:12:09 +0000, "Howie Wang"
<[hidden email]> said:

>
> Never looked at this bit of code, but for the example
> you provided with <embed src="...">, shouldn't the
> code be:
>
>    + linkParams.put("embed", new LinkParams("embed","src", 0));
>
> not
>
>    + linkParams.put("embed", new LinkParams("embed","source", 0));
>
> Howie
>
>
> > From: [hidden email]
> > To: [hidden email]; [hidden email]
> > Subject: Extracting Embedded Outlinks
> > Date: Wed, 23 Apr 2008 11:45:40 -0400
> >
> > I'm trying to extract outlinks to embedded youtube videos encoded as
> > below, using a post Nutch 0.9 system.
> >  
> > <object width="425" height="355"><param name="movie"
> > value="http://www.youtube.com/v/8iYRjK2KSps&rel=1"></param><param
> > name="wmode" value="transparent"></param><embed
> > src="http://www.youtube.com/v/8iYRjK2KSps&rel=1"
> > type="application/x-shockwave-flash" wmode="transparent" width="425"
> > height="355"></embed></object>
> >
> > <embed src="http://www.youtube.com/v/A1_GQ-K7P_w&amp;rel=" width="425"
> > height="355" type="application/x-shockwave-flash"
> > wmode="transparent"></embed>
> >
> > I modified DOMContentUtils.java as follows:
> >
> >   public void setConf(Configuration conf) {
> >    + System.out.println("setting linkparams conf");
> >     this.conf = conf;
> >     linkParams.clear();
> >     linkParams.put("a", new LinkParams("a", "href", 1));
> >    + linkParams.put("embed", new LinkParams("embed","source", 0));
> >    + linkParams.put("object", new LinkParams("object", "movie", 2));
> >     linkParams.put("area", new LinkParams("area", "href", 0));
> >     if (conf.getBoolean("parser.html.form.use_action", false)) {
> >       linkParams.put("form", new LinkParams("form", "action", 1));
> >     }
> >     linkParams.put("frame", new LinkParams("frame", "src", 0));
> >     linkParams.put("iframe", new LinkParams("iframe", "src", 0));
> >     linkParams.put("script", new LinkParams("script", "src", 0));
> >     linkParams.put("link", new LinkParams("link", "href", 0));
> >     linkParams.put("img", new LinkParams("img", "src", 0));
> >   }
> >
> > But nothing happens.  These links are always ignored.  In fact, the
> > print statement never prints.
> >
> > How can I extract these outlinks?
> >
> > Brian
> >
> >
> > --
> >   Brian Ulicny
> >   bulicny at alum dot mit dot edu
> >   home: 781-721-5746
> >   fax: 360-361-5746
> >
> >
>
> _________________________________________________________________
> In a rush? Get real-time answers with Windows Live Messenger.
> http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_realtime_042008
--
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746


Reply | Threaded
Open this post in threaded view
|

Re: Fetching inefficiency

Siddhartha Reddy
In reply to this post by Otis Gospodnetic-2-2
Hi Otis,


> Great.  Could you please let us know if using the recipe on
> http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much,
> roughly?
>

I am trying a slightly different strategy: I am going to run the generate
jobs in parallel with the fetch job. As for running updatedb in parallel
with the fetch job, I am not too sure -- updatedb can take a list of
segments, won't it be better to update all of them together? In any case, I
will report on any improvements I get.

> On my first attempt, I could not apply the NUTCH-570 patch, so I left it
> for
> > later. Anyways, as long as I am using a small generate.max.per.host I
> doubt
> > that it would help much.
>
> I can send you my Generator.java, if you want, it has NUTCH-570 and a few
> other
> little changes.
>

Thanks, that would really help me; can you please send it to me?

> I am using NUTCH-629 but I am not sure how to measure if it is offering
> any
> > improvements.
>
> I think the same way you described in the first paragraph - by looking at
> the
> total time it took for the fetch job to complete, or perhaps simply by
> looking at
> pg/sec rates and eyeballing.  The idea there is that if requests to a host
> keep
> timing out, there is no point in wasting time requesting more pages from
> it.
> This really only pays off if hosts with lots of URLs in the fetchlists
> time out.
> There is no point in dropping hosts with only a few URLs, as even with
> time outs
> those will be processed quickly.  It is those with lots of pages and that
> keep
> timing out that are the problem.  So you should see the greatest benefit
> in
> those cases.
>

The problem is that the URLs from the hosts on the slow servers are all
already fetched or timed out and I do not wish to hit the same URLs again.
Perhaps I can just dump the crawldb and take a look at the metadata.

Thanks,
Siddhartha

On Wed, Apr 23, 2008 at 9:00 PM, <[hidden email]> wrote:

> Hi,
>
>  ----- Original Message ----
>
> > From: Siddhartha Reddy <[hidden email]>
> > To: [hidden email]
> > Sent: Wednesday, April 23, 2008 12:49:07 AM
> > Subject: Re: Fetching inefficiency
> >
> > I have observed a significant improvement after setting
> > generate.max.per.host to 1000. Earlier, one of my fetch job for a few
> > thousand pages went on for days because of a couple of sites that were
> too
> > slow. For the same crawl, I am now using a generate.max.per.host of 1000
> and
> > each fetch job finishes in about 3hrs for around 30,000 pages while the
> > other jobs -- generate, parse, updatedb -- take up another hour.
> >
> > You are right about the additional overhead of having more generate
> jobs. I
> > am now planning to parallelize the generate jobs with fetch (by using
> > numFetchers that is less then the number of map tasks available) and am
> > hoping that it would offset the time for the additional generates.
>
> Great.  Could you please let us know if using the recipe on
> http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much,
> roughly?
>
> > The cost of setting up the MapReduce jobs might in fact become a
> significant
> > one if I reduce the generate.max.per.hosts even further (or it might
> even be
> > quite a lot and I am just not noticing.) I will be doing some
> > experimentation to find the optimum point; but the results might be too
> > specific to my current crawl.
> >
> > On my first attempt, I could not apply the NUTCH-570 patch, so I left it
> for
> > later. Anyways, as long as I am using a small generate.max.per.host I
> doubt
> > that it would help much.
>
> I can send you my Generator.java, if you want, it has NUTCH-570 and a few
> other
> little changes.
>
> > I am using NUTCH-629 but I am not sure how to measure if it is offering
> any
> > improvements.
>
> I think the same way you described in the first paragraph - by looking at
> the
> total time it took for the fetch job to complete, or perhaps simply by
> looking at
> pg/sec rates and eyeballing.  The idea there is that if requests to a host
> keep
> timing out, there is no point in wasting time requesting more pages from
> it.
> This really only pays off if hosts with lots of URLs in the fetchlists
> time out.
> There is no point in dropping hosts with only a few URLs, as even with
> time outs
> those will be processed quickly.  It is those with lots of pages and that
> keep
> timing out that are the problem.  So you should see the greatest benefit
> in
> those cases.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> > On Wed, Apr 23, 2008 at 9:29 AM, wrote:
> >
> > > Siddhartha,
> > >
> > > I think decreasing generate.max.per.host will limit the 'wait time'
> for
> > > each fetch run, but I have a feeling that the overall time will be
> roughly
> > > the same.  As a matter of fact, it may be even higher, because you'll
> have
> > > to run generate more times, and if your fetch jobs are too short, you
> will
> > > be spending more time waiting on MapReduce jobs (JVM instantiation,
> job
> > > initialization....)
> > >
> > >
> > > Have you tried NUTCH-570?  I know it doesn't break anything, but I
> have
> > > not been able to see its positive effects - likely because my fetch
> cycles
> > > are dominated by those slow servers with lots of pages and not by wait
> time
> > > between subsequent requests to the same server.  But I'd love to hear
> if
> > > others found NUTCH-570 helpful!
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> > > ----- Original Message ----
> > > > From: Siddhartha Reddy
> > > > To: [hidden email]
> > > > Sent: Monday, April 21, 2008 4:59:03 PM
> > > > Subject: Re: Fetching inefficiency
> > > >
> > > > I do face a similar problem. I occasionally have some fetch jobs
> that
> > > are
> > > > fetching from less than 100 hosts, the effect is magnified in this
> case.
> > > >
> > > > I have found one workaround for this but I am not sure if this is
> the
> > > best
> > > > possible solution: I set the value of generate.max.per.host to a
> pretty
> > > > small value (like 1000) and this reduces the maximum amount of time
> any
> > > task
> > > > is going to be held up due to a particular host. This does increase
> the
> > > > number of cycles that are needed to finish a crawl but does solve
> the
> > > > mentioned problem. It might even make sense to have an even lower
> value
> > > -- I
> > > > am still experimenting to find a good value myself.
> > > >
> > > > In addition, I think NUTCH-629 and NUTCH-570 could help reduce the
> > > effects
> > > > of the problem caused by slow servers.
> > > >
> > > > Best,
> > > > Siddhartha Reddy
> > > >
> > > > On Tue, Apr 22, 2008 at 1:46 AM, wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am wondering how others deal with the following, which I see as
> > > fetching
> > > > > inefficiency:
> > > > >
> > > > >
> > > > > When fetching, the fetchlist is broken up into multiple parts and
> > > fetchers
> > > > > on cluster nodes start fetching.  Some fetchers end up fetching
> from
> > > fast
> > > > > servers, and some from very very slow servers.  Those fetching
> from
> > > slow
> > > > > servers take a long time to complete and prolong the whole
> fetching
> > > process.
> > > > >  For instance, I've seen tasks from the same fetch job finish in
> only
> > > 1-2
> > > > > hours, and others in 10 hours.  Those taking 10 hours were stuck
> > > fetching
> > > > > pages from a single or handful of slow sites.  If you have two
> nodes
> > > doing
> > > > > the fetching and one is stuck with a slow server, the other one is
> > > idling
> > > > > and wasting time.  The node stuck with the slow server is also
> > > > > underutilized, as it's slowly fetching from only 1 server instead
> of
> > > many.
> > > > >
> > > > > I imagine anyone using Nutch is seeing the same.  If not, what's
> the
> > > > > trick?
> > > > >
> > > > > I have not tried overlapping fetching jobs yet, but I have a
> feeling
> > > that
> > > > > won't help a ton, plus it could lead to two fetchers fetching from
> the
> > > same
> > > > > server and being impolite - am I wrong?
> > > > >
> > > > > Thanks,
> > > > > Otis
> > > > > --
> > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > http://sids.in
> > > > "If you are not having fun, you are not doing it right."
> > >
> > >
> >
> >
> > --
> > http://sids.in
> > "If you are not having fun, you are not doing it right."
>
>


--
http://sids.in
"If you are not having fun, you are not doing it right."