nutch inject bug(fix)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

nutch inject bug(fix)

Jochen Frey-2
Hi,

I believe there is a bug in 'nutch inject'. When the db contains a url
with status DB_fetched, and the same url is injected, then the status is
(sometimes) reset to DB_unfetched. I belive it depends on the order in
which urls make it into the reduce-set.

If this is indeed a bug and can be confirmed by someone else, then the
patch below should fix it.

Please comment / advise.

Thanks!
Jochen


Patch:

Index:
C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
---
C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java    
(revision 398634)
+++
C:/eclipse/workspace/nutch/src/java/org/apache/nutch/crawl/CrawlDbReducer.java    
(working copy)
@@ -58,7 +58,9 @@
       case CrawlDatum.STATUS_DB_UNFETCHED:
       case CrawlDatum.STATUS_DB_FETCHED:
       case CrawlDatum.STATUS_DB_GONE:
-        old = datum;
+          if(old == null || (old.getStatus() < datum.getStatus())) {
+                  old = datum;
+          }
         break;
       case CrawlDatum.STATUS_LINKED:
         scoreIncrement += datum.getScore();