[jira] Created: (NUTCH-634) Patch - Nutch - Hadoop 0.17.0

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-634:
------------------------------------

    Attachment: hadoop-0.17.patch

Patch to upgrade to Hadoop 0.17.1. This builds upon the previous patches, but it also replaces many deprecated API uses. It also uses the workaround discussed previously, instead of using specialized InputFormat-s.

> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615503#action_12615503 ]

Hudson commented on NUTCH-634:
------------------------------

Integrated in Nutch-trunk #516 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/516/])

> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615815#action_12615815 ]

Roman Valls commented on NUTCH-634:
-----------------------------------

As promised, I've tested this patch in production (7-node cluster)... the crawl gets halted after these exceptions:

java.lang.AbstractMethodError: org.apache.nutch.crawl.PartitionUrlByHost.getPartition(Ljava/lang/Object;Ljava/lang/Object;I)I
        at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:171)
        at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:83)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:464)
        at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:165)
        at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:83)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

java.lang.AbstractMethodError: org.apache.nutch.crawl.PartitionUrlByHost.getPartition(Ljava/lang/Object;Ljava/lang/Object;I)I
        at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:171)
        at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:83)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:464)
        at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:165)
        at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:83)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

java.lang.AbstractMethodError: org.apache.nutch.crawl.PartitionUrlByHost.getPartition(Ljava/lang/Object;Ljava/lang/Object;I)I
        at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:171)
        at org.apache.nutch.crawl.Generator$Selector.getPartition(Generator.java:83)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:464)
        at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:165)
        at org.apache.nutch.crawl.Generator$Selector.map(Generator.java:83)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:457)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:394)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:116)


> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615933#action_12615933 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

Please make sure your environment is clean - i.e. there are no leftover classes from previous versions of Hadoop or Nutch. I tested Generator again, and I can't reproduce this error.

> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615975#action_12615975 ]

Roman Valls commented on NUTCH-634:
-----------------------------------

Sure, it was my fault :/

ant clean && ant solved the problem, now the crawl is progressing as it should.

Thanks !

PS: I've also ran the test suite and there are errors after cleaning the environment:

hadoop@braintop:~/nutch$ ant test | grep -i failed
    [junit] Test org.apache.nutch.crawl.TestCrawlDbMerger FAILED
    [junit] Test org.apache.nutch.crawl.TestGenerator FAILED
    [junit] Test org.apache.nutch.crawl.TestInjector FAILED
    [junit] Test org.apache.nutch.crawl.TestLinkDbMerger FAILED
    [junit] Test org.apache.nutch.crawl.TestMapWritable FAILED
    [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED
    [junit] Test org.apache.nutch.indexer.TestDeleteDuplicates FAILED
    [junit] Test org.apache.nutch.searcher.TestDistributedSearch FAILED


> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619966#action_12619966 ]

Andrzej Bialecki  commented on NUTCH-634:
-----------------------------------------

Let's handle these in a separate issue - could you please create it? I'm closing this one.

> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-634.
-----------------------------------

    Resolution: Fixed

> Patch - Nutch - Hadoop 0.17.1
> -----------------------------
>
>                 Key: NUTCH-634
>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Michael Gottesman
>            Assignee: Andrzej Bialecki
>             Fix For: 1.0.0
>
>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>
>
> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
> The patch compiles and passes all current Nutch unit tests.
> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
> 1. Changes to the Hadoop Iterator
> 2. Addition of Serialization to MapReduce Framework

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

brainstorm-2-2
In reply to this post by Jorge Spinsanti (Jira)
Sure, I was about to create one new ticket for this one, but I stopped
thinking that is my fault (again), otherwise Hudson did have
complained already, right ? Is your test run ok ?

On Tue, Aug 5, 2008 at 7:30 PM, Andrzej Bialecki  (JIRA)
<[hidden email]> wrote:

>
>    [ https://issues.apache.org/jira/browse/NUTCH-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619966#action_12619966 ]
>
> Andrzej Bialecki  commented on NUTCH-634:
> -----------------------------------------
>
> Let's handle these in a separate issue - could you please create it? I'm closing this one.
>
>> Patch - Nutch - Hadoop 0.17.1
>> -----------------------------
>>
>>                 Key: NUTCH-634
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-634
>>             Project: Nutch
>>          Issue Type: Improvement
>>    Affects Versions: 1.0.0
>>            Reporter: Michael Gottesman
>>            Assignee: Andrzej Bialecki
>>             Fix For: 1.0.0
>>
>>         Attachments: diff, hadoop-0.17.patch, hadoop-0.17.patch, hadoop-0.17.patch
>>
>>
>> This is a patch so that Nutch can be used with Hadoop 0.17.0. The patch is located at http://pastie.org/212001
>> The patch compiles and passes all current Nutch unit tests.
>> I have tested that the crawler side of Nutch (i.e. inject, generate, fetch, parse, merge w/crawldb) definetly works, but have not tested the lucene indexing part. It might work, but it might not.
>> *NOTE* - the two main bugs that had to be overcome were not noticed by any of the unit tests. The bugs only came up during actual testing. The bugs were:
>> 1. Changes to the Hadoop Iterator
>> 2. Addition of Serialization to MapReduce Framework
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-634) Patch - Nutch - Hadoop 0.17.1

Andrzej Białecki-2
brainstorm wrote:
> Sure, I was about to create one new ticket for this one, but I stopped
> thinking that is my fault (again), otherwise Hudson did have
> complained already, right ? Is your test run ok ?

It does, but only when I run it locally - when run on a Hadoop cluster
it fails. I'll create a separate issue for that.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

12