Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

Andrzej Białecki-2
(redirected to nutch-dev)

[hidden email] wrote:

> CrawlDbReducer#reduce doesn't have a switch case for
> CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121)
> block which throws a RuntimeException.   This causes my update db job
> to never succeed.
>
> This has just recently started happening.
>
> Enabling logging I see that what usually happens is that a CrawlDatum
> with a STATUS_SIGNATURE status comes through first and is set to be
> 'highest' (line #49) but then the next record through takes over the
> 'highest' role because its status is higher, usually 'fetch_success'
> or 'linked' in my case.
>
> But for reasons not clear to me, I'll sometimes have a lone CrawlDatum
> with a status of STATUS_SIGNATURE (A mapout lost a record?) with no
> following 'fetch_success' or 'linked' CrawlDatum.
> This probably shouldn't fail the job.
>
> Attached is a patch that logs a warning and keeps going but probably
> not the right soln.
How weird, This Should Never Happen(tm) ... ;) Lost map output should
show up in logs, or perhaps even should've killed the job, isn't that
so? I'll apply your patch for now, but we need to keep an eye on this.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Index: src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDbReducer.java (revision 397664)
+++ src/java/org/apache/nutch/crawl/CrawlDbReducer.java (working copy)
@@ -19,11 +19,16 @@
 import java.util.Iterator;
 import java.io.IOException;
 
+import java.util.logging.*;
+
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;
+import org.apache.hadoop.util.LogFormatter;
 
 /** Merge new page entries with existing entries. */
 public class CrawlDbReducer implements Reducer {
+  public static final Logger LOG =
+    LogFormatter.getLogger("org.apache.nutch.crawl.CrawlDbReducer");
   private int retryMax;
   private CrawlDatum result = new CrawlDatum();
 
@@ -102,6 +107,8 @@
       result.setNextFetchTime();
       break;
 
+    case CrawlDatum.STATUS_SIGNATURE:
+      LOG.warning("Lone CrawlDatum.STATUS_SIGNATURE: " + key);      
     case CrawlDatum.STATUS_FETCH_RETRY:           // temporary failure
       if (old != null)
         result.setSignature(old.getSignature());  // use old signature
@@ -119,7 +126,7 @@
       break;
 
     default:
-      throw new RuntimeException("Unknown status: "+highest.getStatus());
+      throw new RuntimeException("Unknown status: "+highest.getStatus() + " " + key);
     }
     
     result.setScore(result.getScore() + scoreIncrement);
Reply | Threaded
Open this post in threaded view
|

Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

Stack-6
Andrzej Bialecki wrote:
> (redirected to nutch-dev)
Pardon me.  I intended to send nutch-dev, not hadoop-dev.
> ...
> How weird, This Should Never Happen(tm) ... ;) Lost map output should
> show up in logs, or perhaps even should've killed the job, isn't that so?
Yes.  I'd  have thought.

> I'll apply your patch for now, but we need to keep an eye on this.
Grand.
St.Ack
Reply | Threaded
Open this post in threaded view
|

Re: CrawlDbReducer and the lone STATUS_SIGNATURE record

Andrzej Białecki-2
Michael Stack wrote:
> Andrzej Bialecki wrote:
>> (redirected to nutch-dev)
> Pardon me.  I intended to send nutch-dev, not hadoop-dev.
>> ...
>> How weird, This Should Never Happen(tm) ... ;) Lost map output should
>> show up in logs, or perhaps even should've killed the job, isn't that
>> so?
> Yes.  I'd  have thought.

Patch applied, please keep an eye on the log messages, if they reappear
we should try to determine their cause.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com