[jira] Created: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
Patch - Nutch - Hadoop 0.17.0
-----------------------------

                 Key: NUTCH-634
                 URL: https://issues.apache.org/jira/browse/NUTCH-634
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Michael Gottesman
             Fix For: 0.9.0


This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001

The patch compiles and passes all current Nutch unit tests.

I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.

*NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:

1. Changes to the Hadoop Iterator
2. Addition of Serialization to MapReduce Framework


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603771#action_12603771 ]

Michael Gottesman commented on NUTCH-634:
-----------------------------------------

I apologize, when I was written this up, I made a mistake, the second bug was definitely caught because the code would not have compiled otherwise... Lack of Sleep => Stupid things...

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>             Fix For: 0.9.0
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned NUTCH-634:
---------------------------------------

    Assignee: Andrzej Bialecki

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604064#action_12604064 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

Please attach the correct patch to this issue - remember to mark the checkbox that grants ASL license.

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Gottesman updated NUTCH-634:
------------------------------------

    Attachment: diff

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604664#action_12604664 ]

Lincoln Ritter commented on NUTCH-634:
--------------------------------------

There's a bug in Michael's patch where segments are being passed to the index merger rather than indexes.  I've attached an updated patch.

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lincoln Ritter updated NUTCH-634:
---------------------------------

    Attachment: hadoop-0.17.patch

Fixes a small problem in the previous patch where segments get passed to the index merger instead of indexes from within Craw.main

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604667#action_12604667 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

The attached diff is not a valid patch created with 'svn diff'. Please create a patch using 'svn diff', from the top of the source tree of Nutch trunk/.

I'm not sure whether the FileOnlySequenceFileOutputFormat is the right answer to the problem of _logs directories ... I think the existence of these directories is caused by a setting in Hadoop contiguration, hadoop.job.history.user.location, which defaults to the output directory (which sounds awfully strange to me to use this as a default!). Further investigation is needed before we mess up things on our side. ;)

The code formatting on these two new files and in some other places doesn't conform to the Nutch formatting, which is basically the Sun style with 2 space indents. Please note also that you use different curly brace placement than the Sun style advises.

Generics on the CrawlDbReducer are too general, instead of

bq. implements Reducer<WritableComparable,Writable,WritableComparable,Writable>

it should be

bq. implements Reducer<Text, CrawlDatum, Text, CrawlDatum>

Similar tightening should be done in other places where you added generics.

The CrawlDatum.shallowCopy() method is dangerous IMHO - newly created copies still contain references to the same metaData instance, which may be modified any time by the framework as you iterate through the input items. We should do a deep clone using WritableUtils.clone().

IndexDoc.copyConstructor() should be replaced by a deep clone().





> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Lincoln Ritter
Which patch are you referring to?  The patch I just added *only*
addressed the index/segments confusion and was created by executing
'svn diff' from the trunk root.

-lincoln

--
lincolnritter.com



On Thu, Jun 12, 2008 at 3:32 PM, Andrzej Bialecki  (JIRA)
<[hidden email]> wrote:

>
>    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604667#action_12604667 ]
>
> Andrzej Bialecki  commented on NUTCH-634:
> -----------------------------------------
>
> The attached diff is not a valid patch created with 'svn diff'. Please create a patch using 'svn diff', from the top of the source tree of Nutch trunk/.
>
> I'm not sure whether the FileOnlySequenceFileOutputFormat is the right answer to the problem of _logs directories ... I think the existence of these directories is caused by a setting in Hadoop contiguration, hadoop.job.history.user.location, which defaults to the output directory (which sounds awfully strange to me to use this as a default!). Further investigation is needed before we mess up things on our side. ;)
>
> The code formatting on these two new files and in some other places doesn't conform to the Nutch formatting, which is basically the Sun style with 2 space indents. Please note also that you use different curly brace placement than the Sun style advises.
>
> Generics on the CrawlDbReducer are too general, instead of
>
> bq. implements Reducer<WritableComparable,Writable,WritableComparable,Writable>
>
> it should be
>
> bq. implements Reducer<Text, CrawlDatum, Text, CrawlDatum>
>
> Similar tightening should be done in other places where you added generics.
>
> The CrawlDatum.shallowCopy() method is dangerous IMHO - newly created copies still contain references to the same metaData instance, which may be modified any time by the framework as you iterate through the input items. We should do a deep clone using WritableUtils.clone().
>
> IndexDoc.copyConstructor() should be replaced by a deep clone().
>
>
>
>
>
>> Patch - Nutch - Hadoop 0.17.0
>> -----------------------------
>>
>>                 Key: NUTCH-634
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>>             Project: Nutch
>>          Issue Type: Improvement
>>    Affects Versions: 0.9.0
>>            Reporter: Michael Gottesman
>>            Assignee: Andrzej Bialecki
>>             Fix For: 0.9.0
>>
>>         Attachments: diff, hadoop-0.17.patch
>>
>>
>> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
>> The patch compiles and passes all current Nutch unit tests.
>> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
>> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
>> 1. Changes to the Hadoop Iterator
>> 2. Addition of Serialization to MapReduce Framework
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Andrzej Białecki-2
Lincoln Ritter wrote:
> Which patch are you referring to?  The patch I just added *only*
> addressed the index/segments confusion and was created by executing
> 'svn diff' from the trunk root.

Right :) You added your patch while I was editing the JIRA comment, so
my comments refer to Michael's patch.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604682#action_12604682 ]

Michael Gottesman commented on NUTCH-634:
-----------------------------------------

So actually, I remembered to make it have an ASF, but forgot to redo the diff =p. Sorry. But it looks like Lincoln's patch suffices. Also here is a quick rundown on your comments.

1. I just put in FileOnlySequenceFileOutputFormat because it was the last bug I was getting. I was a little annoyed at the time so I just stuck it in. There is actually a native hadoop way of doing this via a static class. I have seen it before in the code, I just dont remember exactly where.

2. About the code indenting. I was screwing with my emacs trying to get it to do that. But I figured you were more interested in the code and I could deal with that latter =p.

3. Generics easy fix =).

4. The reason that I did the shallowcopy thing even with the metadata, it was not clear to me at the time (I remember being distinctly very tired) since it is of type byte[] if it would be considered a native type or an object. Now of course, I realize that I was really smoking something there... but thats besides the point =p.

5. The IndexDoc.copyConstructor() was just put in because I was not sure if a deep clone would be needed or not.

So in sum all of what you suggest should be easy changes. =). I will redownload the trunk and do the svn from the trunk, and correct those points.

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lincoln Ritter updated NUTCH-634:
---------------------------------

    Attachment: hadoop-0.17.patch

My previous patch left out FileSequenceOnlyOutputFileFormat.java and FileOnlyPathFilter.java.  This patch includes them.

However, I don't think these are long-term solutions.  Michael can clarify, but it seems that this code is meant to deal with thinks like '_logs' directories in Hadoop.  What we really need is some way to ask Hadoop whether the a directory or file is somehow 'special'.  I'm not sure what the definition of 'special' is here though...

Michael suggested there was a way to do this and I will be looking into it as well, but it would be nice if someone more familiar with Hadoop than I am could chime in to suggest a course of action here.

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609295#action_12609295 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

This issue will likely be fixed in Hadoop 0.19, until then we can work around this in Nutch by overriding Hadoop property hadoop.job.history.user.location and set it to e.g. ${hadoop.log.dir}/history/user . IMHO using special OutputFormat introduces more confusion and complicates the future upgrades ... Either way, this would have to be documented in the release notes.

I'd like to move forward on this issue in the next few days, if the solution I propose above seems acceptable - that is, to remove the use of special OutputFormats and add an override for that Hadoop property in nutch-default.xml

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609308#action_12609308 ]

Michael Gottesman commented on NUTCH-634:
-----------------------------------------

There is actually a special thing in hadoop called the HiddenFileFilter in FileInputFormat (or filter I dont remember which). I recently emailed the hadoop dev-list and asked if that could be at the public vs private scope (it resolves the issue by filtering all files that being with _ i.e. _logs). The list said to submit a patch and it would be integrated into hadoop 0.19.

I am going to submit the hadoop patch in a few minutes. In the meantime your idea seems absolutely lovely.

So yes, your suggestion is prefect =).


> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Lincoln Ritter
Just to clarify: Andrzej, the resolution you speak of in 0.19 - is
that resolution independent of Michael's patch?

I think any solution with less code is preferable, so a configuration
change seems like a great way to go.  (I didn't realize one could
change hadoop parameters from the nutch config!) That being said, well
defined Hadoop behavior shouldn't break Nutch, so exposing a public
interface for "special" files (like hidden files) I think is a good
idea.  Nutch mysteriously breaking because it can't determine its
input properly seems much more confusing (to a user anyway) than an
additional few lines of code.

-lincoln

--
lincolnritter.com



On Mon, Jun 30, 2008 at 10:51 AM, Michael Gottesman (JIRA)
<[hidden email]> wrote:

>
>    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609308#action_12609308 ]
>
> Michael Gottesman commented on NUTCH-634:
> -----------------------------------------
>
> There is actually a special thing in hadoop called the HiddenFileFilter in FileInputFormat (or filter I dont remember which). I recently emailed the hadoop dev-list and asked if that could be at the public vs private scope (it resolves the issue by filtering all files that being with _ i.e. _logs). The list said to submit a patch and it would be integrated into hadoop 0.19.
>
> I am going to submit the hadoop patch in a few minutes. In the meantime your idea seems absolutely lovely.
>
> So yes, your suggestion is prefect =).
>
>
>> Patch - Nutch - Hadoop 0.17.0
>> -----------------------------
>>
>>                 Key: NUTCH-634
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>>             Project: Nutch
>>          Issue Type: Improvement
>>    Affects Versions: 0.9.0
>>            Reporter: Michael Gottesman
>>            Assignee: Andrzej Bialecki
>>             Fix For: 0.9.0
>>
>>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>>
>>
>> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
>> The patch compiles and passes all current Nutch unit tests.
>> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
>> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
>> 1. Changes to the Hadoop Iterator
>> 2. Addition of Serialization to MapReduce Framework
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Andrzej Białecki-2
Lincoln Ritter wrote:
> Just to clarify: Andrzej, the resolution you speak of in 0.19 - is
> that resolution independent of Michael's patch?

Yes, this is something that will be submitted in a separate Hadoop JIRA
issue.

>
> I think any solution with less code is preferable, so a configuration
> change seems like a great way to go.  (I didn't realize one could
> change hadoop parameters from the nutch config!)

Nutch configuration files are loaded later than Hadoop config files, and
any properties defined there, which are not already declared "final" in
Hadoop, can be overridden. Usually you don't notice this, because Nutch
uses property names that don't collide with Hadoop property names. Also,
this mechanism was a bit different in older versions of Hadoop, where
whole resources were declared "final" instead of individual properties.

> That being said, well
> defined Hadoop behavior shouldn't break Nutch,

But that's the problem - this Hadoop feature is ill-defined, and it even
breaks internal Hadoop classes such as MapFileOutputFormat.getReaders().

>  so exposing a public
> interface for "special" files (like hidden files) I think is a good
> idea.  Nutch mysteriously breaking because it can't determine its
> input properly seems much more confusing (to a user anyway) than an
> additional few lines of code.

Well, generally speaking I agree - but in this particular case it's a
Hadoop mis-feature that needs to be avoided for the time being. We can't
fix this bug in Hadoop 0.17 or 0.18, only in 0.19 (and then perhaps it
can be backported to 0.17.1 or 0.18.1).


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609415#action_12609415 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

I ran a test crawl using Hadoop 0.17.1 release, after applying the portions of this patch without the OutputFormat and setting the property as above. The crawl succeeded with no problems.

If there are no further objections, I'd like to commit this patch with these changes within a day or two.

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613972#action_12613972 ]

Roman Valls commented on NUTCH-634:
-----------------------------------

Is there any blocking/pending code on this ? Cannot see it on trunk... I think that (hadoop 0.16 related bug):

http://issues.apache.org/jira/browse/HADOOP-3007

http://article.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/6947/match=dfs+connection+reset

This issue is slowing down/blocking my DFS operations with nutch... I'm offering to betatest 0.17 hadoop (this patch) + nutch trunk on my modest 3 node cluster :)

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614005#action_12614005 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

Sorry, I was away ... I don't think there are any pending issues, I'm going to commit this in a few days.

> Patch - Nutch - Hadoop 0.17.0
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-634:
------------------------------------

    Affects Version/s:     (was: 0.9.0)
                       1.0.0
        Fix Version/s:     (was: 0.9.0)
                       1.0.0
              Summary: Patch - Nutch - Hadoop 0.17.1  (was: Patch - Nutch - Hadoop 0.17.0)

In the meantime Hadoop has released version 0.17.1, so it makes sense to upgrade to this version instead of 0.17.0.

> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

12