[jira] Created: (NUTCH-101) RobotRulesParser

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-101) RobotRulesParser

Soren Daugaard (Jira)
RobotRulesParser
----------------

         Key: NUTCH-101
         URL: http://issues.apache.org/jira/browse/NUTCH-101
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7, 0.8-dev    
    Reporter: Fuad Efendi


I noticed this code in protocol-http & protocol-httpclient plugins:

      } else if ( (line.length() >= 6)
                  && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {


However, according to the original 1994 protocol description, there is NO "Allow:" field. To allow, simply use "Disallow:  ". http://www.robotstxt.org/wc/norobots.html

Please, try to test with www.newegg.com/robots.txt
- their site has this:
User-agent: *
Disallow:

And Nutch does not work with New Egg, but it should!

Sorry guys, I don't have enough time to double-ensure, could you please verify all this...

I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need to test ......./robots.txt

User-agent: ia_archiver
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: TurnitinBot
Disallow: /    


- everything according to standard protocol. Can you retest please whether it works with multiline? It's a standard!

I see this in code:
   StringTokenizer tok = new StringTokenizer(agentNames, ",");
 
Comma separated? It's not accepted standard yet...

Sorry WebExpertsAmerica, I really didn't have any time to make any test...

Please do not execute tests against production sites.
Thanks!




--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-101) RobotRulesParser

Soren Daugaard (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-101?page=comments#action_12331658 ]

Fuad Efendi commented on NUTCH-101:
-----------------------------------

1. There is a bug in method parseRules(byte[] robotContent):
...
    StringTokenizer lineParser= new StringTokenizer(content, "\n\r");
...

Should be:
...
    content = content.replaceAll("\\r+", "\n");
    content = content.replaceAll("\\n+", "\n");
    StringTokenizer lineParser = new StringTokenizer(content, "\n");
...
(or something better)

Even more characters should be allowed:
- newline (line feed) character ('\n'),
- carriage-return character followed immediately by a newline character ("\r\n"),
- standalone carriage-return character ('\r'),
- next-line character ('\u0085'),
- line-separator character ('\u2028')
- paragraph-separator character ('\u2029)

2. The code contains check "Allow:" - however it works fine with standard empty "Disallow:" == allow everything

3. There is minor bug in main():
...
      String[] robotNames= new String[argv.length - 1];
...

Must be:
...
      String[] robotNames= new String[argv.length - 2];
...




> RobotRulesParser
> ----------------
>
>          Key: NUTCH-101
>          URL: http://issues.apache.org/jira/browse/NUTCH-101
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.8-dev
>     Reporter: Fuad Efendi

>
> I noticed this code in protocol-http & protocol-httpclient plugins:
>       } else if ( (line.length() >= 6)
>                   && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {
> However, according to the original 1994 protocol description, there is NO "Allow:" field. To allow, simply use "Disallow:  ". http://www.robotstxt.org/wc/norobots.html
> Please, try to test with www.newegg.com/robots.txt
> - their site has this:
> User-agent: *
> Disallow:
> And Nutch does not work with New Egg, but it should!
> Sorry guys, I don't have enough time to double-ensure, could you please verify all this...
> I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need to test ......./robots.txt
> User-agent: ia_archiver
> Disallow: /
> User-agent: Googlebot-Image
> Disallow: /
> User-agent: Nutch
> Disallow: /
> User-agent: TurnitinBot
> Disallow: /    
> - everything according to standard protocol. Can you retest please whether it works with multiline? It's a standard!
> I see this in code:
>    StringTokenizer tok = new StringTokenizer(agentNames, ",");
>  
> Comma separated? It's not accepted standard yet...
> Sorry WebExpertsAmerica, I really didn't have any time to make any test...
> Please do not execute tests against production sites.
> Thanks!

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-101) RobotRulesParser

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-101?page=all ]

Fuad Efendi updated NUTCH-101:
------------------------------

    Version: 0.6
             0.7.1

> RobotRulesParser
> ----------------
>
>          Key: NUTCH-101
>          URL: http://issues.apache.org/jira/browse/NUTCH-101
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>     Reporter: Fuad Efendi

>
> I noticed this code in protocol-http & protocol-httpclient plugins:
>       } else if ( (line.length() >= 6)
>                   && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {
> However, according to the original 1994 protocol description, there is NO "Allow:" field. To allow, simply use "Disallow:  ". http://www.robotstxt.org/wc/norobots.html
> Please, try to test with www.newegg.com/robots.txt
> - their site has this:
> User-agent: *
> Disallow:
> And Nutch does not work with New Egg, but it should!
> Sorry guys, I don't have enough time to double-ensure, could you please verify all this...
> I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need to test ......./robots.txt
> User-agent: ia_archiver
> Disallow: /
> User-agent: Googlebot-Image
> Disallow: /
> User-agent: Nutch
> Disallow: /
> User-agent: TurnitinBot
> Disallow: /    
> - everything according to standard protocol. Can you retest please whether it works with multiline? It's a standard!
> I see this in code:
>    StringTokenizer tok = new StringTokenizer(agentNames, ",");
>  
> Comma separated? It's not accepted standard yet...
> Sorry WebExpertsAmerica, I really didn't have any time to make any test...
> Please do not execute tests against production sites.
> Thanks!

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira