[jira] [Created] (NUTCH-2754) fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (NUTCH-2754) fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.

Luís Filipe Nassif (Jira)
Sebastian Nagel created NUTCH-2754:
--------------------------------------

             Summary: fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
                 Key: NUTCH-2754
                 URL: https://issues.apache.org/jira/browse/NUTCH-2754
             Project: Nutch
          Issue Type: Bug
          Components: fetcher, robots
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17


Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more) are always ignored, even if fetcher.max.crawl.delay is set to a higher value.

We need to pass a higher value of fetcher.max.crawl.delay to [crawler-commons' robots.txt parser|https://github.com/crawler-commons/crawler-commons/blob/c9c0ac6eda91b13d534e69f6da3fd15065414fb0/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java#L78] otherwise it will use the internal default value of 300 sec. and disallow all sites specifying a longer Crawl-Delay in their robots.txt.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)