Nutch - Focused crawling

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch - Focused crawling

zzeran
Hi,

We've been using Nutch for focused crawling (right now we are crawling about
50 domains).

We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.

90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5 hours - making the slowest domain the bottleneck of the
entire fetch process.

I've read Ken Krugler document and he's describing the same problem:
http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/

I'm wondering - does anyone have a suggestion on what's the best way to
tackle this issue?

I think that Ken suggested to limit the fetch time - for example say
"terminate after 1 hour, even if you are not done yet", is that feature
available in Nutch?

I will be happy to try and contribute code if required!

Thanks,
Eran
Reply | Threaded
Open this post in threaded view
|

Re: Nutch - Focused crawling

Julien Nioche-4
Hi Eran,

There is currently no time limit implemented in the Fetcher. We implemented
one which worked quite well in combination with another mechanism which
clears the URLs from a pool if more than x successive exceptions have been
encountered. This limits cases where a site or domain is not responsive.

I might try and submit a patch if I find the time next week, our code has
been heavily modified with the previous patches which have not been
committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need
to spend a bit of time extracting this specific functionality from the rest.

Best,

Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com


2009/11/21 Eran Zinman <[hidden email]>

> Hi,
>
> We've been using Nutch for focused crawling (right now we are crawling
> about
> 50 domains).
>
> We've encountered the long-tail problem - We've set TopN to 100,000 and
> generate.max.per.host to about 1500.
>
> 90% of all domains finish fetching after 30min, and the other 10% takes an
> additional 2.5 hours - making the slowest domain the bottleneck of the
> entire fetch process.
>
> I've read Ken Krugler document and he's describing the same problem:
>
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
>
> I'm wondering - does anyone have a suggestion on what's the best way to
> tackle this issue?
>
> I think that Ken suggested to limit the fetch time - for example say
> "terminate after 1 hour, even if you are not done yet", is that feature
> available in Nutch?
>
> I will be happy to try and contribute code if required!
>
> Thanks,
> Eran
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch - Focused crawling

Julien Nioche-4
Hi guys,

I've separated both functionalities into separate patches on JIRA (NUTCH-769
/ NUTCH-770).

Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com

2009/11/21 Julien Nioche <[hidden email]>

> Hi Eran,
>
> There is currently no time limit implemented in the Fetcher. We implemented
> one which worked quite well in combination with another mechanism which
> clears the URLs from a pool if more than x successive exceptions have been
> encountered. This limits cases where a site or domain is not responsive.
>
> I might try and submit a patch if I find the time next week, our code has
> been heavily modified with the previous patches which have not been
> committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need
> to spend a bit of time extracting this specific functionality from the rest.
>
> Best,
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> 2009/11/21 Eran Zinman <[hidden email]>
>
> Hi,
>>
>> We've been using Nutch for focused crawling (right now we are crawling
>> about
>> 50 domains).
>>
>> We've encountered the long-tail problem - We've set TopN to 100,000 and
>> generate.max.per.host to about 1500.
>>
>> 90% of all domains finish fetching after 30min, and the other 10% takes an
>> additional 2.5 hours - making the slowest domain the bottleneck of the
>> entire fetch process.
>>
>> I've read Ken Krugler document and he's describing the same problem:
>>
>> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
>>
>> I'm wondering - does anyone have a suggestion on what's the best way to
>> tackle this issue?
>>
>> I think that Ken suggested to limit the fetch time - for example say
>> "terminate after 1 hour, even if you are not done yet", is that feature
>> available in Nutch?
>>
>> I will be happy to try and contribute code if required!
>>
>> Thanks,
>> Eran
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch - Focused crawling

zzeran
Thanks Julien,

I can confirm this patch works perfectly and does a good job of keeping a
good crawl rate.

We have doubled the rate of information retrieval by using a time limit on
the fetch queue.

Thanks,
Eran

On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche <
[hidden email]> wrote:

> Hi guys,
>
> I've separated both functionalities into separate patches on JIRA
> (NUTCH-769
> / NUTCH-770).
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/11/21 Julien Nioche <[hidden email]>
>
> > Hi Eran,
> >
> > There is currently no time limit implemented in the Fetcher. We
> implemented
> > one which worked quite well in combination with another mechanism which
> > clears the URLs from a pool if more than x successive exceptions have
> been
> > encountered. This limits cases where a site or domain is not responsive.
> >
> > I might try and submit a patch if I find the time next week, our code has
> > been heavily modified with the previous patches which have not been
> > committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd
> need
> > to spend a bit of time extracting this specific functionality from the
> rest.
> >
> > Best,
> >
> > Julien
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> >
> > 2009/11/21 Eran Zinman <[hidden email]>
> >
> > Hi,
> >>
> >> We've been using Nutch for focused crawling (right now we are crawling
> >> about
> >> 50 domains).
> >>
> >> We've encountered the long-tail problem - We've set TopN to 100,000 and
> >> generate.max.per.host to about 1500.
> >>
> >> 90% of all domains finish fetching after 30min, and the other 10% takes
> an
> >> additional 2.5 hours - making the slowest domain the bottleneck of the
> >> entire fetch process.
> >>
> >> I've read Ken Krugler document and he's describing the same problem:
> >>
> >>
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
> >>
> >> I'm wondering - does anyone have a suggestion on what's the best way to
> >> tackle this issue?
> >>
> >> I think that Ken suggested to limit the fetch time - for example say
> >> "terminate after 1 hour, even if you are not done yet", is that feature
> >> available in Nutch?
> >>
> >> I will be happy to try and contribute code if required!
> >>
> >> Thanks,
> >> Eran
> >>
> >
> >
>