100 fetches per second?

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

100 fetches per second?

Mark Kerzner
Hi, guys,

my goal is to do by crawls at 100 fetches per second, observing, of course,
polite crawling. But, when URLs are all different domains, what
theoretically would stop some software from downloading from 100 domains at
once, achieving the desired speed?

But, whatever I do, I can't make Nutch crawl at that speed. Even if it
starts at a few dozen URLs/second, it slows down at the end (as discussed by
many and by Krugler).

Should I write something of my own, or are their fast crawlers?

Thanks!

Mark
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Dennis Kubes-2
Hi Mark,

I just put this up on the wiki.  Hope it helps:

http://wiki.apache.org/nutch/OptimizingCrawls

Dennis


Mark Kerzner wrote:

> Hi, guys,
>
> my goal is to do by crawls at 100 fetches per second, observing, of course,
> polite crawling. But, when URLs are all different domains, what
> theoretically would stop some software from downloading from 100 domains at
> once, achieving the desired speed?
>
> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
> starts at a few dozen URLs/second, it slows down at the end (as discussed by
> many and by Krugler).
>
> Should I write something of my own, or are their fast crawlers?
>
> Thanks!
>
> Mark
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Mark Kerzner
Dennis, that's awesomely interesting. Thank you,

Mark

On Tue, Nov 24, 2009 at 10:01 AM, Dennis Kubes <[hidden email]> wrote:

> Hi Mark,
>
> I just put this up on the wiki.  Hope it helps:
>
> http://wiki.apache.org/nutch/OptimizingCrawls
>
> Dennis
>
>
>
> Mark Kerzner wrote:
>
>> Hi, guys,
>>
>> my goal is to do by crawls at 100 fetches per second, observing, of
>> course,
>> polite crawling. But, when URLs are all different domains, what
>> theoretically would stop some software from downloading from 100 domains
>> at
>> once, achieving the desired speed?
>>
>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>> starts at a few dozen URLs/second, it slows down at the end (as discussed
>> by
>> many and by Krugler).
>>
>> Should I write something of my own, or are their fast crawlers?
>>
>> Thanks!
>>
>> Mark
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Julien Nioche-4
In reply to this post by Mark Kerzner
Hi Mark,

I've recently contributed 2 patches on JIRA (NUTCH-769 / NUTCH-770) which
will have an impact on the speed of the crawling. This should help with the
fetch rate slowing down.
There is also https://issues.apache.org/jira/browse/NUTCH-753 which should
help to a lesser extent.

Julien

--
DigitalPebble Ltd
http://www.digitalpebble.com

2009/11/24 Mark Kerzner <[hidden email]>

> Hi, guys,
>
> my goal is to do by crawls at 100 fetches per second, observing, of course,
> polite crawling. But, when URLs are all different domains, what
> theoretically would stop some software from downloading from 100 domains at
> once, achieving the desired speed?
>
> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
> starts at a few dozen URLs/second, it slows down at the end (as discussed
> by
> many and by Krugler).
>
> Should I write something of my own, or are their fast crawlers?
>
> Thanks!
>
> Mark
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
In reply to this post by Dennis Kubes-2
Why would DNS local caching work... It only is working if you are
going to crawl often the same site ... In which case you are hit by
the politeness.

if you have segments with only/mainly different sites it is not/really
going to help.

So far I have not seen my quad core + 100mb/s + pseudo distributed
hadoop  going faster than 10 fetch / s... Let me check the DNS and I
will tell you.

I vote for 100 Fetch/s not sure how to get it though



2009/11/24, Dennis Kubes <[hidden email]>:

> Hi Mark,
>
> I just put this up on the wiki.  Hope it helps:
>
> http://wiki.apache.org/nutch/OptimizingCrawls
>
> Dennis
>
>
> Mark Kerzner wrote:
>> Hi, guys,
>>
>> my goal is to do by crawls at 100 fetches per second, observing, of
>> course,
>> polite crawling. But, when URLs are all different domains, what
>> theoretically would stop some software from downloading from 100 domains
>> at
>> once, achieving the desired speed?
>>
>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>> starts at a few dozen URLs/second, it slows down at the end (as discussed
>> by
>> many and by Krugler).
>>
>> Should I write something of my own, or are their fast crawlers?
>>
>> Thanks!
>>
>> Mark
>>
>

--
Envoyé avec mon mobile

-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Mark Kerzner
I may be awfully wrong on that, but below is my plan for super-fast
crawling. I have prepared it for a venture that does not need it anymore,
but it looks like fun to do anyway. What would you all say: is there a need,
and what's wrong with the plan?

Thank you,
Mark

Fast Crawl Plan
=========

The goal of Nutch is exhaustive crawl. It works best for internal sites, or
intranets. It has known problems with wide web search. It is optimized for
correctness, and it is also an open-source engine for all dummies to use, so
it has polite crawling that is hard to get messed up, but it is not
optimized for performance.

I also see another area that slows it down: it uses a database. This makes
it easy to program, scale, and operate. It does not make it a fast runner.
All fast applications don't use databases.

Therefore, I would write my own crawler, optimized for performance. Here is
what my approach would be:

   - I would look at Nutch code for code snippets, for example, I would look
   at Fetcher.java, so as not to re-invent a wheel;
   - Having made the individual in-thread performance reasonably fast, I
   would do the following optimization steps;
   - Use a fast mechanism of real-time thread coordination, not database,
   but JavaSpaces (free GigaSpaces implementation);
   - Prepare URLs to do simultaneous fetching from different domains in
   different threads, and for more-or-less polite crawling within a domain;
   - Build-in blocking detection. Today we don't even know when and if we
   are blocked - and blocking can give time-outs;
   - Do it on one crawler for starters, but keep in mind that the code
   should later be scaled to a Hadoop cluster.

Mark

On Tue, Nov 24, 2009 at 11:32 AM, MilleBii <[hidden email]> wrote:

> Why would DNS local caching work... It only is working if you are
> going to crawl often the same site ... In which case you are hit by
> the politeness.
>
> if you have segments with only/mainly different sites it is not/really
> going to help.
>
> So far I have not seen my quad core + 100mb/s + pseudo distributed
> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> will tell you.
>
> I vote for 100 Fetch/s not sure how to get it though
>
>
>
> 2009/11/24, Dennis Kubes <[hidden email]>:
> > Hi Mark,
> >
> > I just put this up on the wiki.  Hope it helps:
> >
> > http://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Dennis
> >
> >
> > Mark Kerzner wrote:
> >> Hi, guys,
> >>
> >> my goal is to do by crawls at 100 fetches per second, observing, of
> >> course,
> >> polite crawling. But, when URLs are all different domains, what
> >> theoretically would stop some software from downloading from 100 domains
> >> at
> >> once, achieving the desired speed?
> >>
> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
> >> starts at a few dozen URLs/second, it slows down at the end (as
> discussed
> >> by
> >> many and by Krugler).
> >>
> >> Should I write something of my own, or are their fast crawlers?
> >>
> >> Thanks!
> >>
> >> Mark
> >>
> >
>
> --
> Envoyé avec mon mobile
>
> -MilleBii-
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
In reply to this post by MilleBii
So I indeed have a local DNS server running does not seem to help so much

Just finished a run of 80K urls, at the beginning the speed can be around 15
Fetch/s and at the end due to long tail effects I get  below  1 Fetch/s...
Average on a 12h34 run : 1,7 Fetch/s pretty slow really.

I limit URL per site at 1000 to limit the long tail effect.... however I
have  half-a-dozen site which I need to index and they have around 60k URLs
so it means I will need 60 runs to get them indexed. So I would like to
increase the limit and the long tail effect. Interesting dilemna.

2009/11/24 MilleBii <[hidden email]>

> Why would DNS local caching work... It only is working if you are
> going to crawl often the same site ... In which case you are hit by
> the politeness.
>
> if you have segments with only/mainly different sites it is not/really
> going to help.
>
> So far I have not seen my quad core + 100mb/s + pseudo distributed
> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> will tell you.
>
> I vote for 100 Fetch/s not sure how to get it though
>
>
>
> 2009/11/24, Dennis Kubes <[hidden email]>:
> > Hi Mark,
> >
> > I just put this up on the wiki.  Hope it helps:
> >
> > http://wiki.apache.org/nutch/OptimizingCrawls
> >
> > Dennis
> >
> >
> > Mark Kerzner wrote:
> >> Hi, guys,
> >>
> >> my goal is to do by crawls at 100 fetches per second, observing, of
> >> course,
> >> polite crawling. But, when URLs are all different domains, what
> >> theoretically would stop some software from downloading from 100 domains
> >> at
> >> once, achieving the desired speed?
> >>
> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
> >> starts at a few dozen URLs/second, it slows down at the end (as
> discussed
> >> by
> >> many and by Krugler).
> >>
> >> Should I write something of my own, or are their fast crawlers?
> >>
> >> Thanks!
> >>
> >> Mark
> >>
> >
>
> --
> Envoyé avec mon mobile
>
> -MilleBii-
>



--
-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
I looked at the bandwidth profile of my last two runs and they have
the same shape:

 starts at 5 MBytes/s and decreases to below 500 kBytes/s with 1/x
kind of curve shape

Fairly even distribution of urls, local DNS is running...

 I can't find a good explanation for this behavior ... It looks to me
that when the fetch queue is full it is a lot less effective ???
I use 100 threads.

2009/11/24, MilleBii <[hidden email]>:

> So I indeed have a local DNS server running does not seem to help so much
>
> Just finished a run of 80K urls, at the beginning the speed can be around
> 15
> Fetch/s and at the end due to long tail effects I get  below  1 Fetch/s...
> Average on a 12h34 run : 1,7 Fetch/s pretty slow really.
>
> I limit URL per site at 1000 to limit the long tail effect.... however I
> have  half-a-dozen site which I need to index and they have around 60k URLs
> so it means I will need 60 runs to get them indexed. So I would like to
> increase the limit and the long tail effect. Interesting dilemna.
>
> 2009/11/24 MilleBii <[hidden email]>
>
>> Why would DNS local caching work... It only is working if you are
>> going to crawl often the same site ... In which case you are hit by
>> the politeness.
>>
>> if you have segments with only/mainly different sites it is not/really
>> going to help.
>>
>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>> will tell you.
>>
>> I vote for 100 Fetch/s not sure how to get it though
>>
>>
>>
>> 2009/11/24, Dennis Kubes <[hidden email]>:
>> > Hi Mark,
>> >
>> > I just put this up on the wiki.  Hope it helps:
>> >
>> > http://wiki.apache.org/nutch/OptimizingCrawls
>> >
>> > Dennis
>> >
>> >
>> > Mark Kerzner wrote:
>> >> Hi, guys,
>> >>
>> >> my goal is to do by crawls at 100 fetches per second, observing, of
>> >> course,
>> >> polite crawling. But, when URLs are all different domains, what
>> >> theoretically would stop some software from downloading from 100
>> >> domains
>> >> at
>> >> once, achieving the desired speed?
>> >>
>> >> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>> >> starts at a few dozen URLs/second, it slows down at the end (as
>> discussed
>> >> by
>> >> many and by Krugler).
>> >>
>> >> Should I write something of my own, or are their fast crawlers?
>> >>
>> >> Thanks!
>> >>
>> >> Mark
>> >>
>> >
>>
>> --
>> Envoyé avec mon mobile
>>
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>

--
Envoyé avec mon mobile

-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Dennis Kubes-2
In reply to this post by MilleBii
It is not about the local DNS caching as much as having local DNS
servers.  Too many fetchers hitting a centralized DNS server can act as
a DOS attack and slow down the entire fetching system.

For example say I have a single centralized DNS server for my network.
And say I have 2 map task per machine, 50 machines, 20 threads per task.
  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
2000  DNS requests / sec.  Most local DNS servers for smaller networks
can't handle that.  If everything is hitting a centralized DNS and that
DNS takes 1-3 sec per request because of too many requests.  The entire
fetching system stalls.

Hitting a secondary larger cache, such as OpenDNS, can have an effect
because you are making one hop to get the name versus multiple hops to
root servers then domain servers.

Working off of a single server these issues don't show up as much
because there aren't enough fetchers.

Dennis Kubes

MilleBii wrote:

> Why would DNS local caching work... It only is working if you are
> going to crawl often the same site ... In which case you are hit by
> the politeness.
>
> if you have segments with only/mainly different sites it is not/really
> going to help.
>
> So far I have not seen my quad core + 100mb/s + pseudo distributed
> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> will tell you.
>
> I vote for 100 Fetch/s not sure how to get it though
>
>
>
> 2009/11/24, Dennis Kubes <[hidden email]>:
>> Hi Mark,
>>
>> I just put this up on the wiki.  Hope it helps:
>>
>> http://wiki.apache.org/nutch/OptimizingCrawls
>>
>> Dennis
>>
>>
>> Mark Kerzner wrote:
>>> Hi, guys,
>>>
>>> my goal is to do by crawls at 100 fetches per second, observing, of
>>> course,
>>> polite crawling. But, when URLs are all different domains, what
>>> theoretically would stop some software from downloading from 100 domains
>>> at
>>> once, achieving the desired speed?
>>>
>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>>> starts at a few dozen URLs/second, it slows down at the end (as discussed
>>> by
>>> many and by Krugler).
>>>
>>> Should I write something of my own, or are their fast crawlers?
>>>
>>> Thanks!
>>>
>>> Mark
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
Get your point... Although I thought high number of threads would do
exactly the same. Maybe I miss something.

During my fetcher runs used bandwidth gets low pretty quickly, disk
I/O is low, the CPU is low... So it must be waiting for something but
what ?

Could be the DNS cache wich is full and any new request gets forwarded
to the master DNS of my ISP,
Any idea how to check that ? I'm not familiar with Bind myself... What
is the typical rate you can get how many dns request/s ?



2009/11/25, Dennis Kubes <[hidden email]>:

> It is not about the local DNS caching as much as having local DNS
> servers.  Too many fetchers hitting a centralized DNS server can act as
> a DOS attack and slow down the entire fetching system.
>
> For example say I have a single centralized DNS server for my network.
> And say I have 2 map task per machine, 50 machines, 20 threads per task.
>   That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
> can't handle that.  If everything is hitting a centralized DNS and that
> DNS takes 1-3 sec per request because of too many requests.  The entire
> fetching system stalls.
>
> Hitting a secondary larger cache, such as OpenDNS, can have an effect
> because you are making one hop to get the name versus multiple hops to
> root servers then domain servers.
>
> Working off of a single server these issues don't show up as much
> because there aren't enough fetchers.
>
> Dennis Kubes
>
> MilleBii wrote:
>> Why would DNS local caching work... It only is working if you are
>> going to crawl often the same site ... In which case you are hit by
>> the politeness.
>>
>> if you have segments with only/mainly different sites it is not/really
>> going to help.
>>
>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>> will tell you.
>>
>> I vote for 100 Fetch/s not sure how to get it though
>>
>>
>>
>> 2009/11/24, Dennis Kubes <[hidden email]>:
>>> Hi Mark,
>>>
>>> I just put this up on the wiki.  Hope it helps:
>>>
>>> http://wiki.apache.org/nutch/OptimizingCrawls
>>>
>>> Dennis
>>>
>>>
>>> Mark Kerzner wrote:
>>>> Hi, guys,
>>>>
>>>> my goal is to do by crawls at 100 fetches per second, observing, of
>>>> course,
>>>> polite crawling. But, when URLs are all different domains, what
>>>> theoretically would stop some software from downloading from 100 domains
>>>> at
>>>> once, achieving the desired speed?
>>>>
>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>>>> starts at a few dozen URLs/second, it slows down at the end (as
>>>> discussed
>>>> by
>>>> many and by Krugler).
>>>>
>>>> Should I write something of my own, or are their fast crawlers?
>>>>
>>>> Thanks!
>>>>
>>>> Mark
>>>>
>>
>

--
Envoyé avec mon mobile

-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Dennis Kubes-2
If it is waiting and the box is idle, my first though is not dns.  I
just put that up as one of the things people will run into.  Most likely
it is uneven distribution of urls or something like that.

Dennis

MilleBii wrote:

> Get your point... Although I thought high number of threads would do
> exactly the same. Maybe I miss something.
>
> During my fetcher runs used bandwidth gets low pretty quickly, disk
> I/O is low, the CPU is low... So it must be waiting for something but
> what ?
>
> Could be the DNS cache wich is full and any new request gets forwarded
> to the master DNS of my ISP,
> Any idea how to check that ? I'm not familiar with Bind myself... What
> is the typical rate you can get how many dns request/s ?
>
>
>
> 2009/11/25, Dennis Kubes <[hidden email]>:
>> It is not about the local DNS caching as much as having local DNS
>> servers.  Too many fetchers hitting a centralized DNS server can act as
>> a DOS attack and slow down the entire fetching system.
>>
>> For example say I have a single centralized DNS server for my network.
>> And say I have 2 map task per machine, 50 machines, 20 threads per task.
>>   That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
>> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
>> can't handle that.  If everything is hitting a centralized DNS and that
>> DNS takes 1-3 sec per request because of too many requests.  The entire
>> fetching system stalls.
>>
>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
>> because you are making one hop to get the name versus multiple hops to
>> root servers then domain servers.
>>
>> Working off of a single server these issues don't show up as much
>> because there aren't enough fetchers.
>>
>> Dennis Kubes
>>
>> MilleBii wrote:
>>> Why would DNS local caching work... It only is working if you are
>>> going to crawl often the same site ... In which case you are hit by
>>> the politeness.
>>>
>>> if you have segments with only/mainly different sites it is not/really
>>> going to help.
>>>
>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>>> will tell you.
>>>
>>> I vote for 100 Fetch/s not sure how to get it though
>>>
>>>
>>>
>>> 2009/11/24, Dennis Kubes <[hidden email]>:
>>>> Hi Mark,
>>>>
>>>> I just put this up on the wiki.  Hope it helps:
>>>>
>>>> http://wiki.apache.org/nutch/OptimizingCrawls
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Mark Kerzner wrote:
>>>>> Hi, guys,
>>>>>
>>>>> my goal is to do by crawls at 100 fetches per second, observing, of
>>>>> course,
>>>>> polite crawling. But, when URLs are all different domains, what
>>>>> theoretically would stop some software from downloading from 100 domains
>>>>> at
>>>>> once, achieving the desired speed?
>>>>>
>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>>>>> starts at a few dozen URLs/second, it slows down at the end (as
>>>>> discussed
>>>>> by
>>>>> many and by Krugler).
>>>>>
>>>>> Should I write something of my own, or are their fast crawlers?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Mark
>>>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Julien Nioche-4
or it is stuck on a couple of hosts which time out? The logs should have a
trace with the number of active threads, which should give some indication
of what's happening.

Julien


2009/11/25 Dennis Kubes <[hidden email]>

> If it is waiting and the box is idle, my first though is not dns.  I just
> put that up as one of the things people will run into.  Most likely it is
> uneven distribution of urls or something like that.
>
> Dennis
>
>
> MilleBii wrote:
>
>> Get your point... Although I thought high number of threads would do
>> exactly the same. Maybe I miss something.
>>
>> During my fetcher runs used bandwidth gets low pretty quickly, disk
>> I/O is low, the CPU is low... So it must be waiting for something but
>> what ?
>>
>> Could be the DNS cache wich is full and any new request gets forwarded
>> to the master DNS of my ISP,
>> Any idea how to check that ? I'm not familiar with Bind myself... What
>> is the typical rate you can get how many dns request/s ?
>>
>>
>>
>> 2009/11/25, Dennis Kubes <[hidden email]>:
>>
>>> It is not about the local DNS caching as much as having local DNS
>>> servers.  Too many fetchers hitting a centralized DNS server can act as
>>> a DOS attack and slow down the entire fetching system.
>>>
>>> For example say I have a single centralized DNS server for my network.
>>> And say I have 2 map task per machine, 50 machines, 20 threads per task.
>>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
>>> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
>>> can't handle that.  If everything is hitting a centralized DNS and that
>>> DNS takes 1-3 sec per request because of too many requests.  The entire
>>> fetching system stalls.
>>>
>>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
>>> because you are making one hop to get the name versus multiple hops to
>>> root servers then domain servers.
>>>
>>> Working off of a single server these issues don't show up as much
>>> because there aren't enough fetchers.
>>>
>>> Dennis Kubes
>>>
>>> MilleBii wrote:
>>>
>>>> Why would DNS local caching work... It only is working if you are
>>>> going to crawl often the same site ... In which case you are hit by
>>>> the politeness.
>>>>
>>>> if you have segments with only/mainly different sites it is not/really
>>>> going to help.
>>>>
>>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>>>> will tell you.
>>>>
>>>> I vote for 100 Fetch/s not sure how to get it though
>>>>
>>>>
>>>>
>>>> 2009/11/24, Dennis Kubes <[hidden email]>:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> I just put this up on the wiki.  Hope it helps:
>>>>>
>>>>> http://wiki.apache.org/nutch/OptimizingCrawls
>>>>>
>>>>> Dennis
>>>>>
>>>>>
>>>>> Mark Kerzner wrote:
>>>>>
>>>>>> Hi, guys,
>>>>>>
>>>>>> my goal is to do by crawls at 100 fetches per second, observing, of
>>>>>> course,
>>>>>> polite crawling. But, when URLs are all different domains, what
>>>>>> theoretically would stop some software from downloading from 100
>>>>>> domains
>>>>>> at
>>>>>> once, achieving the desired speed?
>>>>>>
>>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>>>>>> starts at a few dozen URLs/second, it slows down at the end (as
>>>>>> discussed
>>>>>> by
>>>>>> many and by Krugler).
>>>>>>
>>>>>> Should I write something of my own, or are their fast crawlers?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>


--
DigitalPebble Ltd
http://www.digitalpebble.com
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
The logs show that my fetch queue is full and my 100 threads are mostly spin
waiting towards the end.

Now the very last run (150kURLs) I can clearly see 4 phases:
+ very high speed : 3MB/s  for a few minutes
+ sudden speed drop around 1MB/s and flat for several hours
+ another speed drop to around 400kB/s for several hours
+ another speed drop to around  200kB/s for a few hours two.

So probably it is just a consequence of the url mix which isn't that good
nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts
in the mix which get limited that way.

May be there is better mix of URLs possible ?

2009/11/25 Julien Nioche <[hidden email]>

> or it is stuck on a couple of hosts which time out? The logs should have a
> trace with the number of active threads, which should give some indication
> of what's happening.
>
> Julien
>
>
> 2009/11/25 Dennis Kubes <[hidden email]>
>
> > If it is waiting and the box is idle, my first though is not dns.  I just
> > put that up as one of the things people will run into.  Most likely it is
> > uneven distribution of urls or something like that.
> >
> > Dennis
> >
> >
> > MilleBii wrote:
> >
> >> Get your point... Although I thought high number of threads would do
> >> exactly the same. Maybe I miss something.
> >>
> >> During my fetcher runs used bandwidth gets low pretty quickly, disk
> >> I/O is low, the CPU is low... So it must be waiting for something but
> >> what ?
> >>
> >> Could be the DNS cache wich is full and any new request gets forwarded
> >> to the master DNS of my ISP,
> >> Any idea how to check that ? I'm not familiar with Bind myself... What
> >> is the typical rate you can get how many dns request/s ?
> >>
> >>
> >>
> >> 2009/11/25, Dennis Kubes <[hidden email]>:
> >>
> >>> It is not about the local DNS caching as much as having local DNS
> >>> servers.  Too many fetchers hitting a centralized DNS server can act as
> >>> a DOS attack and slow down the entire fetching system.
> >>>
> >>> For example say I have a single centralized DNS server for my network.
> >>> And say I have 2 map task per machine, 50 machines, 20 threads per
> task.
> >>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
> >>> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
> >>> can't handle that.  If everything is hitting a centralized DNS and that
> >>> DNS takes 1-3 sec per request because of too many requests.  The entire
> >>> fetching system stalls.
> >>>
> >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
> >>> because you are making one hop to get the name versus multiple hops to
> >>> root servers then domain servers.
> >>>
> >>> Working off of a single server these issues don't show up as much
> >>> because there aren't enough fetchers.
> >>>
> >>> Dennis Kubes
> >>>
> >>> MilleBii wrote:
> >>>
> >>>> Why would DNS local caching work... It only is working if you are
> >>>> going to crawl often the same site ... In which case you are hit by
> >>>> the politeness.
> >>>>
> >>>> if you have segments with only/mainly different sites it is not/really
> >>>> going to help.
> >>>>
> >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
> >>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> >>>> will tell you.
> >>>>
> >>>> I vote for 100 Fetch/s not sure how to get it though
> >>>>
> >>>>
> >>>>
> >>>> 2009/11/24, Dennis Kubes <[hidden email]>:
> >>>>
> >>>>> Hi Mark,
> >>>>>
> >>>>> I just put this up on the wiki.  Hope it helps:
> >>>>>
> >>>>> http://wiki.apache.org/nutch/OptimizingCrawls
> >>>>>
> >>>>> Dennis
> >>>>>
> >>>>>
> >>>>> Mark Kerzner wrote:
> >>>>>
> >>>>>> Hi, guys,
> >>>>>>
> >>>>>> my goal is to do by crawls at 100 fetches per second, observing, of
> >>>>>> course,
> >>>>>> polite crawling. But, when URLs are all different domains, what
> >>>>>> theoretically would stop some software from downloading from 100
> >>>>>> domains
> >>>>>> at
> >>>>>> once, achieving the desired speed?
> >>>>>>
> >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if
> it
> >>>>>> starts at a few dozen URLs/second, it slows down at the end (as
> >>>>>> discussed
> >>>>>> by
> >>>>>> many and by Krugler).
> >>>>>>
> >>>>>> Should I write something of my own, or are their fast crawlers?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> Mark
> >>>>>>
> >>>>>>
> >>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>



--
-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Mark Kerzner
Judging by how this discussion goes, there may be a need for URL mix
optimizer and for a fast crawler based on that. Is this something worth
pursuing. MilleBii, q'en pensez vous?

Mark

On Wed, Nov 25, 2009 at 3:44 PM, MilleBii <[hidden email]> wrote:

> The logs show that my fetch queue is full and my 100 threads are mostly
> spin
> waiting towards the end.
>
> Now the very last run (150kURLs) I can clearly see 4 phases:
> + very high speed : 3MB/s  for a few minutes
> + sudden speed drop around 1MB/s and flat for several hours
> + another speed drop to around 400kB/s for several hours
> + another speed drop to around  200kB/s for a few hours two.
>
> So probably it is just a consequence of the url mix which isn't that good
> nota: I have limited to 1000 URLS per host, and there are about 20-30 hosts
> in the mix which get limited that way.
>
> May be there is better mix of URLs possible ?
>
> 2009/11/25 Julien Nioche <[hidden email]>
>
> > or it is stuck on a couple of hosts which time out? The logs should have
> a
> > trace with the number of active threads, which should give some
> indication
> > of what's happening.
> >
> > Julien
> >
> >
> > 2009/11/25 Dennis Kubes <[hidden email]>
> >
> > > If it is waiting and the box is idle, my first though is not dns.  I
> just
> > > put that up as one of the things people will run into.  Most likely it
> is
> > > uneven distribution of urls or something like that.
> > >
> > > Dennis
> > >
> > >
> > > MilleBii wrote:
> > >
> > >> Get your point... Although I thought high number of threads would do
> > >> exactly the same. Maybe I miss something.
> > >>
> > >> During my fetcher runs used bandwidth gets low pretty quickly, disk
> > >> I/O is low, the CPU is low... So it must be waiting for something but
> > >> what ?
> > >>
> > >> Could be the DNS cache wich is full and any new request gets forwarded
> > >> to the master DNS of my ISP,
> > >> Any idea how to check that ? I'm not familiar with Bind myself... What
> > >> is the typical rate you can get how many dns request/s ?
> > >>
> > >>
> > >>
> > >> 2009/11/25, Dennis Kubes <[hidden email]>:
> > >>
> > >>> It is not about the local DNS caching as much as having local DNS
> > >>> servers.  Too many fetchers hitting a centralized DNS server can act
> as
> > >>> a DOS attack and slow down the entire fetching system.
> > >>>
> > >>> For example say I have a single centralized DNS server for my
> network.
> > >>> And say I have 2 map task per machine, 50 machines, 20 threads per
> > task.
> > >>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
> > >>> 2000  DNS requests / sec.  Most local DNS servers for smaller
> networks
> > >>> can't handle that.  If everything is hitting a centralized DNS and
> that
> > >>> DNS takes 1-3 sec per request because of too many requests.  The
> entire
> > >>> fetching system stalls.
> > >>>
> > >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
> > >>> because you are making one hop to get the name versus multiple hops
> to
> > >>> root servers then domain servers.
> > >>>
> > >>> Working off of a single server these issues don't show up as much
> > >>> because there aren't enough fetchers.
> > >>>
> > >>> Dennis Kubes
> > >>>
> > >>> MilleBii wrote:
> > >>>
> > >>>> Why would DNS local caching work... It only is working if you are
> > >>>> going to crawl often the same site ... In which case you are hit by
> > >>>> the politeness.
> > >>>>
> > >>>> if you have segments with only/mainly different sites it is
> not/really
> > >>>> going to help.
> > >>>>
> > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
> > >>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
> > >>>> will tell you.
> > >>>>
> > >>>> I vote for 100 Fetch/s not sure how to get it though
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2009/11/24, Dennis Kubes <[hidden email]>:
> > >>>>
> > >>>>> Hi Mark,
> > >>>>>
> > >>>>> I just put this up on the wiki.  Hope it helps:
> > >>>>>
> > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls
> > >>>>>
> > >>>>> Dennis
> > >>>>>
> > >>>>>
> > >>>>> Mark Kerzner wrote:
> > >>>>>
> > >>>>>> Hi, guys,
> > >>>>>>
> > >>>>>> my goal is to do by crawls at 100 fetches per second, observing,
> of
> > >>>>>> course,
> > >>>>>> polite crawling. But, when URLs are all different domains, what
> > >>>>>> theoretically would stop some software from downloading from 100
> > >>>>>> domains
> > >>>>>> at
> > >>>>>> once, achieving the desired speed?
> > >>>>>>
> > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even
> if
> > it
> > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as
> > >>>>>> discussed
> > >>>>>> by
> > >>>>>> many and by Krugler).
> > >>>>>>
> > >>>>>> Should I write something of my own, or are their fast crawlers?
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>> Mark
> > >>>>>>
> > >>>>>>
> > >>
> >
> >
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
>
>
>
> --
> -MilleBii-
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :

got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad. But why
this high-speed on this run I haven't got the faintest idea.


Than it drops and I get that kind of logs

2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516

Don't fully understand why it is oscillating between two queue size never
mind.... but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.

Would that be explained by a better URL mix ????

2009/11/25 Mark Kerzner <[hidden email]>

> Judging by how this discussion goes, there may be a need for URL mix
> optimizer and for a fast crawler based on that. Is this something worth
> pursuing. MilleBii, q'en pensez vous?
>
> Mark
>
> On Wed, Nov 25, 2009 at 3:44 PM, MilleBii <[hidden email]> wrote:
>
> > The logs show that my fetch queue is full and my 100 threads are mostly
> > spin
> > waiting towards the end.
> >
> > Now the very last run (150kURLs) I can clearly see 4 phases:
> > + very high speed : 3MB/s  for a few minutes
> > + sudden speed drop around 1MB/s and flat for several hours
> > + another speed drop to around 400kB/s for several hours
> > + another speed drop to around  200kB/s for a few hours two.
> >
> > So probably it is just a consequence of the url mix which isn't that good
> > nota: I have limited to 1000 URLS per host, and there are about 20-30
> hosts
> > in the mix which get limited that way.
> >
> > May be there is better mix of URLs possible ?
> >
> > 2009/11/25 Julien Nioche <[hidden email]>
> >
> > > or it is stuck on a couple of hosts which time out? The logs should
> have
> > a
> > > trace with the number of active threads, which should give some
> > indication
> > > of what's happening.
> > >
> > > Julien
> > >
> > >
> > > 2009/11/25 Dennis Kubes <[hidden email]>
> > >
> > > > If it is waiting and the box is idle, my first though is not dns.  I
> > just
> > > > put that up as one of the things people will run into.  Most likely
> it
> > is
> > > > uneven distribution of urls or something like that.
> > > >
> > > > Dennis
> > > >
> > > >
> > > > MilleBii wrote:
> > > >
> > > >> Get your point... Although I thought high number of threads would do
> > > >> exactly the same. Maybe I miss something.
> > > >>
> > > >> During my fetcher runs used bandwidth gets low pretty quickly, disk
> > > >> I/O is low, the CPU is low... So it must be waiting for something
> but
> > > >> what ?
> > > >>
> > > >> Could be the DNS cache wich is full and any new request gets
> forwarded
> > > >> to the master DNS of my ISP,
> > > >> Any idea how to check that ? I'm not familiar with Bind myself...
> What
> > > >> is the typical rate you can get how many dns request/s ?
> > > >>
> > > >>
> > > >>
> > > >> 2009/11/25, Dennis Kubes <[hidden email]>:
> > > >>
> > > >>> It is not about the local DNS caching as much as having local DNS
> > > >>> servers.  Too many fetchers hitting a centralized DNS server can
> act
> > as
> > > >>> a DOS attack and slow down the entire fetching system.
> > > >>>
> > > >>> For example say I have a single centralized DNS server for my
> > network.
> > > >>> And say I have 2 map task per machine, 50 machines, 20 threads per
> > > task.
> > > >>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility
> of
> > > >>> 2000  DNS requests / sec.  Most local DNS servers for smaller
> > networks
> > > >>> can't handle that.  If everything is hitting a centralized DNS and
> > that
> > > >>> DNS takes 1-3 sec per request because of too many requests.  The
> > entire
> > > >>> fetching system stalls.
> > > >>>
> > > >>> Hitting a secondary larger cache, such as OpenDNS, can have an
> effect
> > > >>> because you are making one hop to get the name versus multiple hops
> > to
> > > >>> root servers then domain servers.
> > > >>>
> > > >>> Working off of a single server these issues don't show up as much
> > > >>> because there aren't enough fetchers.
> > > >>>
> > > >>> Dennis Kubes
> > > >>>
> > > >>> MilleBii wrote:
> > > >>>
> > > >>>> Why would DNS local caching work... It only is working if you are
> > > >>>> going to crawl often the same site ... In which case you are hit
> by
> > > >>>> the politeness.
> > > >>>>
> > > >>>> if you have segments with only/mainly different sites it is
> > not/really
> > > >>>> going to help.
> > > >>>>
> > > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
> > > >>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and
> I
> > > >>>> will tell you.
> > > >>>>
> > > >>>> I vote for 100 Fetch/s not sure how to get it though
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> 2009/11/24, Dennis Kubes <[hidden email]>:
> > > >>>>
> > > >>>>> Hi Mark,
> > > >>>>>
> > > >>>>> I just put this up on the wiki.  Hope it helps:
> > > >>>>>
> > > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls
> > > >>>>>
> > > >>>>> Dennis
> > > >>>>>
> > > >>>>>
> > > >>>>> Mark Kerzner wrote:
> > > >>>>>
> > > >>>>>> Hi, guys,
> > > >>>>>>
> > > >>>>>> my goal is to do by crawls at 100 fetches per second, observing,
> > of
> > > >>>>>> course,
> > > >>>>>> polite crawling. But, when URLs are all different domains, what
> > > >>>>>> theoretically would stop some software from downloading from 100
> > > >>>>>> domains
> > > >>>>>> at
> > > >>>>>> once, achieving the desired speed?
> > > >>>>>>
> > > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even
> > if
> > > it
> > > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as
> > > >>>>>> discussed
> > > >>>>>> by
> > > >>>>>> many and by Krugler).
> > > >>>>>>
> > > >>>>>> Should I write something of my own, or are their fast crawlers?
> > > >>>>>>
> > > >>>>>> Thanks!
> > > >>>>>>
> > > >>>>>> Mark
> > > >>>>>>
> > > >>>>>>
> > > >>
> > >
> > >
> > > --
> > > DigitalPebble Ltd
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > -MilleBii-
> >
>



--
-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Andrzej Białecki-2
MilleBii wrote:

> I have to say that I'm still puzzled. Here is the latest. I just restarted a
> run and then guess what :
>
> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
> 3Mbit/s max before (nota bits and not bytes as I said before).
> A few samples show that I was running at 50 Fetches/sec ... not bad. But why
> this high-speed on this run I haven't got the faintest idea.
>
>
> Than it drops and I get that kind of logs
>
> 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
> spinWaiting=100, fetchQueues.totalSize=516
> 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
> spinWaiting=100, fetchQueues.totalSize=120
> 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
> spinWaiting=100, fetchQueues.totalSize=516
> 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
> spinWaiting=100, fetchQueues.totalSize=120
> 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
> spinWaiting=100, fetchQueues.totalSize=516
>
> Don't fully understand why it is oscillating between two queue size never
> mind.... but it is likely the end of the run since hadoop shows 99.99%
> percent complete for the 2 map it generated.
>
> Would that be explained by a better URL mix ????

I suspect that you have a bunch of hosts that slowly trickle the
content, i.e. requests don't time out, crawl-delay is low, but the
download speed is very very low due to the limits at their end (either
physical or artificial).

The solution in that case would be to track a minimum avg. speed per
FetchQueue, and lock-out the queue if this number crosses the threshold
(similarly to what we do when we discover a crawl-delay that is too high).

In the meantime, you could add the number of FetchQueue-s to that
diagnostic output, to see how many unique hosts are in the current
working set.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Dennis Kubes-2
One interesting thing we were seeing a while back on large crawls where
we were fetching the best scoring pages first, then next best, and so
on, is that lower scoring pages typically had worse response time rates
and worse timeout rates.

So while the best scoring pages would respond very quickly and would
have < 1% timeout rate, the worst scoring pages would take x times as
long (don't remember the exact ratio but it was multiples) and could
have as high as a 50% timeout rate.  Just something to think about.

Dennis Kubes

Andrzej Bialecki wrote:

> MilleBii wrote:
>> I have to say that I'm still puzzled. Here is the latest. I just
>> restarted a
>> run and then guess what :
>>
>> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could
>> only get
>> 3Mbit/s max before (nota bits and not bytes as I said before).
>> A few samples show that I was running at 50 Fetches/sec ... not bad.
>> But why
>> this high-speed on this run I haven't got the faintest idea.
>>
>>
>> Than it drops and I get that kind of logs
>>
>> 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>> 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=120
>> 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>> 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=120
>> 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>>
>> Don't fully understand why it is oscillating between two queue size never
>> mind.... but it is likely the end of the run since hadoop shows 99.99%
>> percent complete for the 2 map it generated.
>>
>> Would that be explained by a better URL mix ????
>
> I suspect that you have a bunch of hosts that slowly trickle the
> content, i.e. requests don't time out, crawl-delay is low, but the
> download speed is very very low due to the limits at their end (either
> physical or artificial).
>
> The solution in that case would be to track a minimum avg. speed per
> FetchQueue, and lock-out the queue if this number crosses the threshold
> (similarly to what we do when we discover a crawl-delay that is too high).
>
> In the meantime, you could add the number of FetchQueue-s to that
> diagnostic output, to see how many unique hosts are in the current
> working set.
>
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
In reply to this post by Andrzej Białecki-2
Did not think of that one... interesting

how & where do you control the number of FetchQueues I only use the default
so I assume there is only one.

How I should do if I want to analyze the content of a generated fetchlist ?

Is it possible to increase the number of fetcher on a single node
configuration ? If not than I may turn to a configuration with two low specs
servers vs one middle range... I will get more out my bucks ;-)

2009/11/26 Andrzej Bialecki <[hidden email]>

> MilleBii wrote:
>
>> I have to say that I'm still puzzled. Here is the latest. I just restarted
>> a
>> run and then guess what :
>>
>> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
>> get
>> 3Mbit/s max before (nota bits and not bytes as I said before).
>> A few samples show that I was running at 50 Fetches/sec ... not bad. But
>> why
>> this high-speed on this run I haven't got the faintest idea.
>>
>>
>> Than it drops and I get that kind of logs
>>
>> 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>> 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=120
>> 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>> 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=120
>> 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
>> spinWaiting=100, fetchQueues.totalSize=516
>>
>> Don't fully understand why it is oscillating between two queue size never
>> mind.... but it is likely the end of the run since hadoop shows 99.99%
>> percent complete for the 2 map it generated.
>>
>> Would that be explained by a better URL mix ????
>>
>
> I suspect that you have a bunch of hosts that slowly trickle the content,
> i.e. requests don't time out, crawl-delay is low, but the download speed is
> very very low due to the limits at their end (either physical or
> artificial).
>
> The solution in that case would be to track a minimum avg. speed per
> FetchQueue, and lock-out the queue if this number crosses the threshold
> (similarly to what we do when we discover a crawl-delay that is too high).
>
> In the meantime, you could add the number of FetchQueue-s to that
> diagnostic output, to see how many unique hosts are in the current working
> set.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

MilleBii
In reply to this post by Dennis Kubes-2
Dennis,

Interesting info, I don't use the standard OPIC scorer but a slightly
modified version which boost pages with content that I'm looking for... so
it could be that my pages are generally on slow servers.

Now heads-up, just started a new run with 450k URLs and it looks like I'm
back to the previous behaviour :
+ 4 Mb/s for a few minutes
+ steady 1.9 Mb/s ... for ages probably since it really means around 10-15
Fetch/s

Why did the previous run go so fast ???? I'm still wondering

2009/11/26 Dennis Kubes <[hidden email]>

> One interesting thing we were seeing a while back on large crawls where we
> were fetching the best scoring pages first, then next best, and so on, is
> that lower scoring pages typically had worse response time rates and worse
> timeout rates.
>
> So while the best scoring pages would respond very quickly and would have <
> 1% timeout rate, the worst scoring pages would take x times as long (don't
> remember the exact ratio but it was multiples) and could have as high as a
> 50% timeout rate.  Just something to think about.
>
> Dennis Kubes
>
>
> Andrzej Bialecki wrote:
>
>> MilleBii wrote:
>>
>>> I have to say that I'm still puzzled. Here is the latest. I just
>>> restarted a
>>> run and then guess what :
>>>
>>> got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only
>>> get
>>> 3Mbit/s max before (nota bits and not bytes as I said before).
>>> A few samples show that I was running at 50 Fetches/sec ... not bad. But
>>> why
>>> this high-speed on this run I haven't got the faintest idea.
>>>
>>>
>>> Than it drops and I get that kind of logs
>>>
>>> 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=516
>>> 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=120
>>> 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=516
>>> 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=120
>>> 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
>>> spinWaiting=100, fetchQueues.totalSize=516
>>>
>>> Don't fully understand why it is oscillating between two queue size never
>>> mind.... but it is likely the end of the run since hadoop shows 99.99%
>>> percent complete for the 2 map it generated.
>>>
>>> Would that be explained by a better URL mix ????
>>>
>>
>> I suspect that you have a bunch of hosts that slowly trickle the content,
>> i.e. requests don't time out, crawl-delay is low, but the download speed is
>> very very low due to the limits at their end (either physical or
>> artificial).
>>
>> The solution in that case would be to track a minimum avg. speed per
>> FetchQueue, and lock-out the queue if this number crosses the threshold
>> (similarly to what we do when we discover a crawl-delay that is too high).
>>
>> In the meantime, you could add the number of FetchQueue-s to that
>> diagnostic output, to see how many unique hosts are in the current working
>> set.
>>
>>


--
-MilleBii-
Reply | Threaded
Open this post in threaded view
|

Re: 100 fetches per second?

Otis Gospodnetic-2-2
In reply to this post by Andrzej Białecki-2
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below.

Can you try https://issues.apache.org/jira/browse/NUTCH-770 and https://issues.apache.org/jira/browse/NUTCH-769 ?

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: Andrzej Bialecki <[hidden email]>
> To: [hidden email]
> Sent: Wed, November 25, 2009 6:13:07 PM
> Subject: Re: 100 fetches per second?
>
> MilleBii wrote:
> > I have to say that I'm still puzzled. Here is the latest. I just restarted a
> > run and then guess what :
> >
> > got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
> > 3Mbit/s max before (nota bits and not bytes as I said before).
> > A few samples show that I was running at 50 Fetches/sec ... not bad. But why
> > this high-speed on this run I haven't got the faintest idea.
> >
> >
> > Than it drops and I get that kind of logs
> >
> > 2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
> > spinWaiting=100, fetchQueues.totalSize=516
> > 2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
> > spinWaiting=100, fetchQueues.totalSize=120
> > 2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
> > spinWaiting=100, fetchQueues.totalSize=516
> > 2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
> > spinWaiting=100, fetchQueues.totalSize=120
> > 2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
> > spinWaiting=100, fetchQueues.totalSize=516
> >
> > Don't fully understand why it is oscillating between two queue size never
> > mind.... but it is likely the end of the run since hadoop shows 99.99%
> > percent complete for the 2 map it generated.
> >
> > Would that be explained by a better URL mix ????
>
> I suspect that you have a bunch of hosts that slowly trickle the content, i.e.
> requests don't time out, crawl-delay is low, but the download speed is very very
> low due to the limits at their end (either physical or artificial).
>
> The solution in that case would be to track a minimum avg. speed per FetchQueue,
> and lock-out the queue if this number crosses the threshold (similarly to what
> we do when we discover a crawl-delay that is too high).
>
> In the meantime, you could add the number of FetchQueue-s to that diagnostic
> output, to see how many unique hosts are in the current working set.
>
> -- Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

12