[jira] Closed: (NUTCH-56) Crawling sites with 403 Forbidden robots.txt

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Closed: (NUTCH-56) Crawling sites with 403 Forbidden robots.txt

Ayush Saxena (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-56?page=all ]
Andrzej Bialecki  closed NUTCH-56:

    Resolution: Fixed

Applied. I changed the name of the property to follow an already existing "http.robots.*" hierarchy.


> Crawling sites with 403 Forbidden robots.txt
> --------------------------------------------
>          Key: NUTCH-56
>          URL: http://issues.apache.org/jira/browse/NUTCH-56
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Andy Liu
>     Priority: Minor
>  Attachments: robots_403.patch
> If a 403 error is encountered when trying to access the robots.txt file, Nutch does not crawl any pages from that site.  This behavior is consistent with the RFC recommendation for the robot exclusion protocol.  
> However, Google does crawl sites that exhibit this type of behavior, because most webmasters of these sites are unaware of robots.txt conventions and do want their site to be crawled.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see: