map/reduce problem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

map/reduce problem

Doğacan Güney-2
Hi,

There seems to be a problem with current nutch svn. If you fetch with
-noParsing option, then parse the segment, all urls have the same
parse_text(which is the parse_text of the first url).

In ParseSegment's map function:
Content content = (Content) value;

If you check the content after this line, it seems to be same for all keys.

Does anyone else have this problem?

Doğacan Güney


Reply | Threaded
Open this post in threaded view
|

Re: map/reduce problem

Sami Siren-2
Doğacan Güney wrote:

> Hi,
>
> There seems to be a problem with current nutch svn. If you fetch with
> -noParsing option, then parse the segment, all urls have the same
> parse_text(which is the parse_text of the first url).
>
> In ParseSegment's map function:
> Content content = (Content) value;
>
> If you check the content after this line, it seems to be same for all keys.
>
> Does anyone else have this problem?
Yes, this was an unfortunate side effect of my optimization efforts,
please try the attached patch if it works for you.

--
  Sami Siren

Index: src/java/org/apache/nutch/protocol/Content.java
===================================================================
--- src/java/org/apache/nutch/protocol/Content.java (revision 475295)
+++ src/java/org/apache/nutch/protocol/Content.java (working copy)
@@ -298,4 +298,12 @@
     return typeName;
   }
 
+  /**
+   * By calling this method one ensures that on next read/write to any property
+   * parent object is consulted to check if decompressing of data is required.
+   */
+  public void forceInflate() {
+    inflated = false;
+  }
+
 }
Index: src/java/org/apache/nutch/parse/ParseSegment.java
===================================================================
--- src/java/org/apache/nutch/parse/ParseSegment.java (revision 475295)
+++ src/java/org/apache/nutch/parse/ParseSegment.java (working copy)
@@ -66,8 +66,9 @@
       newKey.set(key.toString());
       key = newKey;
     }
-    Content content = (Content)value;
-
+    Content content = (Content) value;
+    content.forceInflate();
+    
     Parse parse = null;
     ParseStatus status;
     try {
Reply | Threaded
Open this post in threaded view
|

Re: map/reduce problem

Doğacan Güney-2
Sami Siren wrote:

> Doğacan Güney wrote:
>> Hi,
>>
>> There seems to be a problem with current nutch svn. If you fetch with
>> -noParsing option, then parse the segment, all urls have the same
>> parse_text(which is the parse_text of the first url).
>>
>> In ParseSegment's map function:
>> Content content = (Content) value;
>>
>> If you check the content after this line, it seems to be same for all
>> keys.
>>
>> Does anyone else have this problem?
>
> Yes, this was an unfortunate side effect of my optimization efforts,
> please try the attached patch if it works for you.
That works just fine. Thanks!

>
> --
>  Sami Siren
> ------------------------------------------------------------------------
>
> Index: src/java/org/apache/nutch/protocol/Content.java
> ===================================================================
> --- src/java/org/apache/nutch/protocol/Content.java (revision 475295)
> +++ src/java/org/apache/nutch/protocol/Content.java (working copy)
> @@ -298,4 +298,12 @@
>      return typeName;
>    }
>  
> +  /**
> +   * By calling this method one ensures that on next read/write to any property
> +   * parent object is consulted to check if decompressing of data is required.
> +   */
> +  public void forceInflate() {
> +    inflated = false;
> +  }
> +
>  }
> Index: src/java/org/apache/nutch/parse/ParseSegment.java
> ===================================================================
> --- src/java/org/apache/nutch/parse/ParseSegment.java (revision 475295)
> +++ src/java/org/apache/nutch/parse/ParseSegment.java (working copy)
> @@ -66,8 +66,9 @@
>        newKey.set(key.toString());
>        key = newKey;
>      }
> -    Content content = (Content)value;
> -
> +    Content content = (Content) value;
> +    content.forceInflate();
> +    
>      Parse parse = null;
>      ParseStatus status;
>      try {
>