[jira] [Commented] (NUTCH-2630) Fetcher to log skipped records by robots.txt

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2630) Fetcher to log skipped records by robots.txt

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641806#comment-16641806 ]

ASF GitHub Bot commented on NUTCH-2630:
---------------------------------------

sebastian-nagel opened a new pull request #387: NUTCH-2630 Fetcher to log skipped records by robots.txt
URL: https://github.com/apache/nutch/pull/387
 
 
   Change required log level to INFO (default) for messages reporting skipped URLs because of robots.txt rules (disallow or crawl delay larger than fetcher.max.crawl.delay).
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Fetcher to log skipped records by robots.txt
> --------------------------------------------
>
>                 Key: NUTCH-2630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2630
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.15
>            Reporter: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.16
>
>
> To analyze problems it would be helpful if fetcher logs URLs which are disallowed in the robots.txt - see [discussion on user mailing list|https://lists.apache.org/thread.html/7fe5b02104ea866aba183d009a5fad59ad4e4daf8954593ef0123dd6@%3Cuser.nutch.apache.org%3E].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)