Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

classic Classic list List threaded Threaded
3 messages Options
S.L
Reply | Threaded
Open this post in threaded view
|

Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

S.L
Hello All

If I set fetcher.threads.per.queue property to more than 1 , I believe the behavior would be to have those many number of threads per host from Nutch, in that case would Nutch still respect the Crawl-Delay directive in robots.txt and not crawl at a faster pace that what is specified in robots.txt.

In short what I am trying to ask is if setting fetcher.threads.per.queue to 1 is required for being as polite as Crawl-Delay in robots.txt expects?

Thx

Sent from my HTC

Reply | Threaded
Open this post in threaded view
|

Re: Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

Julien Nioche-4
>
> If I set fetcher.threads.per.queue property to more than 1 , I believe the
> behavior would be to have those many number of threads per host from Nutch,
> in that case would Nutch still respect the Crawl-Delay directive in
> robots.txt and not crawl at a faster pace that what is specified in
> robots.txt.
>

> In short what I am trying to ask is if setting fetcher.threads.per.queue
> to 1 is required for being as polite as Crawl-Delay in robots.txt expects?
>

Using more than 1 thread per queue will ignore any crawl-delay obtained
from robots.txt (see
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java#L317)
and use the fetcher.server.min.delay configuration which has a default
value of 0. So yes, setting fetcher.threads.per.queue to 1 is required for
being as polite as Crawl-Delay in robots.txt expects.

HTH

Julien

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
mak
Reply | Threaded
Open this post in threaded view
|

Re: Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

mak
Perfect, thank you Julien!


On Thu, Jun 26, 2014 at 10:21 AM, Julien Nioche <
[hidden email]> wrote:

> >
> > If I set fetcher.threads.per.queue property to more than 1 , I believe
> the
> > behavior would be to have those many number of threads per host from
> Nutch,
> > in that case would Nutch still respect the Crawl-Delay directive in
> > robots.txt and not crawl at a faster pace that what is specified in
> > robots.txt.
> >
>
> > In short what I am trying to ask is if setting fetcher.threads.per.queue
> > to 1 is required for being as polite as Crawl-Delay in robots.txt
> expects?
> >
>
> Using more than 1 thread per queue will ignore any crawl-delay obtained
> from robots.txt (see
>
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java#L317
> )
> and use the fetcher.server.min.delay configuration which has a default
> value of 0. So yes, setting fetcher.threads.per.queue to 1 is required for
> being as polite as Crawl-Delay in robots.txt expects.
>
> HTH
>
> Julien
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>