[jira] Created: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

Kenneth William Krugler (Jira)
RobotRulesParser interprets robots.txt incorrectly
--------------------------------------------------

         Key: NUTCH-98
         URL: http://issues.apache.org/jira/browse/NUTCH-98
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7    
    Reporter: Jeff Bowden
    Priority: Minor


Here's a simple example that the current RobotRulesParser gets wrong:

User-agent: *
Disallow: /
Allow: /rss


The problem is that the isAllowed function takes the first rule that matches and incorrectly decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the *longest* rule that matches.  I will attach a patch that fixes this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

Kenneth William Krugler (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-98?page=all ]

Jeff Bowden updated NUTCH-98:
-----------------------------

    Attachment: RobotRulesParser.java.diff

Patch to fix interpretation of robots.txt

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the *longest* rule that matches.  I will attach a patch that fixes this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12330858 ]

Doug Cutting commented on NUTCH-98:
-----------------------------------

Where is there a specification of robots.txt that defines how 'allow' and 'disallow' lines interact?  I can't even find anything that specifies the semantics of 'allow' lines at all!

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the *longest* rule that matches.  I will attach a patch that fixes this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12330867 ]

Jeff Bowden commented on NUTCH-98:
----------------------------------

OK, so actually I'm wrong on two counts.  

1.  The current accepted standard does not have Allow lines

2. The draft standard does (http://www.robotstxt.org/wc/norobots-rfc.html), but it specifies that the robot should take the first match found (Nutch's current implementation)

Any rule that is a prefix matched by an earlier rule is rendered completely non-effective according to the standard.  My patch was motivated by what I thought was the obvious interpretation given examples I've seen in the field.  The initial example I gave is from http://del.icio.us/robots.txt





> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the *longest* rule that matches.  I will attach a patch that fixes this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12359237 ]

Rod Taylor commented on NUTCH-98:
---------------------------------

According to the Googlebot faq their implementation takes the longest matching URL as the one they obey.

See point 7 of http://www.google.com/webmasters/bot.html.

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches and incorrectly decides that URLs starting with "/rss" are Disallowed.  The correct algorithm is to take the *longest* rule that matches.  I will attach a patch that fixes this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira