lib-http crawl-delay problem

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

lib-http crawl-delay problem

Doğacan Güney-2
Hi,

There seems to be two small bugs in lib-http's RobotRulesParser.

First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow:


In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

Second is about main method. RobotRulesParser.main advertises its usage
as "<robots-file> <url-file> <agent-name>+" but if you give it more than
one agent time it refuses it.

Trivial patch attached.

--
Doğacan Güney

Index: src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
===================================================================
--- src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java (revision 507852)
+++ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java (working copy)
@@ -389,15 +389,17 @@
       } else if ( (line.length() >= 12)
                   && (line.substring(0, 12).equalsIgnoreCase("Crawl-Delay:"))) {
         doneAgents = true;
-        long crawlDelay = -1;
-        String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
-        if (delay.length() > 0) {
-          try {
-            crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
-          } catch (Exception e) {
-            LOG.info("can not parse Crawl-Delay:" + e.toString());
+        if (addRules) {
+          long crawlDelay = -1;
+          String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
+          if (delay.length() > 0) {
+            try {
+              crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
+            } catch (Exception e) {
+              LOG.info("can not parse Crawl-Delay:" + e.toString());
+            }
+            currentRules.setCrawlDelay(crawlDelay);
           }
-          currentRules.setCrawlDelay(crawlDelay);
         }
       }
     }
@@ -500,7 +502,7 @@
 
   /** command-line main for testing */
   public static void main(String[] argv) {
-    if (argv.length != 3) {
+    if (argv.length < 3) {
       System.out.println("Usage:");
       System.out.println("   java <robots-file> <url-file> <agent-name>+");
       System.out.println("");
@@ -513,7 +515,7 @@
     try {
       FileInputStream robotsIn= new FileInputStream(argv[0]);
       LineNumberReader testsIn= new LineNumberReader(new FileReader(argv[1]));
-      String[] robotNames= new String[argv.length - 1];
+      String[] robotNames= new String[argv.length - 2];
 
       for (int i= 0; i < argv.length - 2; i++)
         robotNames[i]= argv[i+2];
Reply | Threaded
Open this post in threaded view
|

Re: lib-http crawl-delay problem

rubdabadub
Hi:

I am unable to get the attached patch via mail. Its better if you
create a JIra issue and attached the patch there.

Thank you.

On 2/15/07, Doğacan Güney <[hidden email]> wrote:

> Hi,
>
> There seems to be two small bugs in lib-http's RobotRulesParser.
>
> First is about reading crawl-delay. The code doesn't check for addRules,
> so the nutch bot will get the crawl-delay value of another robot's
> crawl-delay in robots.txt. Let me try to be more clear:
>
> User-agent: foobot
> Crawl-delay: 3600
>
> User-agent: *
> Disallow:
>
>
> In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
> value, no matter what nutch bot's name actually is.
>
> Second is about main method. RobotRulesParser.main advertises its usage
> as "<robots-file> <url-file> <agent-name>+" but if you give it more than
> one agent time it refuses it.
>
> Trivial patch attached.
>
> --
> Doğacan Güney
>
>
Reply | Threaded
Open this post in threaded view
|

Re: lib-http crawl-delay problem

Doğacan Güney-2
rubdabadub wrote:
> Hi:
>
> I am unable to get the attached patch via mail. Its better if you
> create a JIra issue and attached the patch there.
>
> Thank you.
>

I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch 

Reply | Threaded
Open this post in threaded view
|

Re: lib-http crawl-delay problem

rubdabadub
Thanks for the link!



On 2/15/07, Doğacan Güney <[hidden email]> wrote:

> rubdabadub wrote:
> > Hi:
> >
> > I am unable to get the attached patch via mail. Its better if you
> > create a JIra issue and attached the patch there.
> >
> > Thank you.
> >
>
> I don't know, this bug seems too minor to require its own JIRA issue.
> So I put the patch to
> http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch
>
>
Reply | Threaded
Open this post in threaded view
|

Re: lib-http crawl-delay problem

Otis Gospodnetic-2-2
In reply to this post by Doğacan Güney-2
HI,

I think the robots.txt example you used was invalid (no path for that last Disallow rule).
Small patch indeed, but sticking it in JIRA would still make sense because:
- it leaves a good record of the bug + fix
- it could be used for release notes/changelog

Not trying to be picky, just pointing this out.

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Doğacan Güney <[hidden email]>
To: [hidden email]
Sent: Thursday, February 15, 2007 9:12:28 PM
Subject: Re: lib-http crawl-delay problem

rubdabadub wrote:
> Hi:
>
> I am unable to get the attached patch via mail. Its better if you
> create a JIra issue and attached the patch there.
>
> Thank you.
>

I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch