[jira] [Commented] (NUTCH-2775) Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2775) Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066555#comment-17066555 ]

ASF GitHub Bot commented on NUTCH-2775:
---------------------------------------

sebastian-nagel commented on pull request #506: NUTCH-2775 Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay
URL: https://github.com/apache/nutch/pull/506
 
 
   The guaranteed minimum delay is configured by `fetcher.min.crawl.delay`, default is set equal to `fetcher.server.delay`.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-2775
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2775
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher, robots
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.17
>
>
> Fetcher uses the amount of seconds defined by "fetcher.server.delay" to delay between successive requests to the same server. Servers can request a longer delay using the Crawl-Delay directive in the robots.txt. This was thought to allow servers to set a longer delay. However, I've recently seen a server requesting "Crawl-Delay: 1". The delay is shorter than the default delay and Nutch may indeed now request one page per second. Later this server responds with "HTTP 429 Too Many Request". Stupid. What about ignoring Crawl-Delay values shorter than the configured default delay or a configurable minimum delay?
> I've already seen the [same issue using a different crawler architecture|https://github.com/commoncrawl/news-crawl/issues/24].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)