Re: nutch vs hadoop versions

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: nutch vs hadoop versions

Otis Gospodnetic-2
Dennis & Co.

Is the 0.15.* -> 0.16 upgrade seamless?  That is, a jar replacement and that's it, or is there an explicit HDFS upgrade step involved?

Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----

> From: Dennis Kubes <[hidden email]>
> To: [hidden email]
> Sent: Saturday, February 9, 2008 11:44:34 PM
> Subject: Re: nutch vs hadoop versions
>
> Ah, time to upgrade to 0.16 :)  I will put it through the tests and some
> fetch cycles.  I was seeing some weird errors with 0.17 though for
> injecting with no urls passing filtering, I want to make sure these
> aren't there for 0.16.
>
> Dennis
>
> Andrzej Bialecki wrote:
> > Kenji Kawai wrote:
> >> Does anybody know when/how the nutch would catch up with the hadoop
> >> versions?  Currently the nutch trunk uses hadoop 0.15.0, and result in
> >> a runtime no-method error when run with the .17 hadoop.  We need to
> >> use the 0.17.0 for our hbase applications.
> >
> > We try to upgrade Nutch to the latest official release of Hadoop soon
> > after it becomes available. Yesterday it was 0.15.3, today it's 0.16.0
> > ;) so in a couple days we will upgrade to that. 0.17 is still under
> > development, so we don't plan to upgrade Nutch to this version until
> > it's released or very close to be released.
> >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: nutch vs hadoop versions

Dennis Kubes-2
You will need to upgrade the HDFS cluster but it goes pretty quick this
time.  Steps are:

1) Stop DFS and Map Reduce
2) Upgrade all slaves with new nutch/hadoop 0.16
3) Run bin/start-dfs.sh -upgrade
4) When finished run some MR jobs to test
5) When satisfied everything is working run bin/hadoop dfsadmin
-finalizeUpgrade

Pretty easy, took me about 20 mins to upgrade the Search Wikia cluster.

Dennis


Dennis & Co.

Is the 0.15.* -> 0.16 upgrade seamless?  That is, a jar replacement and
that's it, or is there an explicit HDFS upgrade step involved?

Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----

 > From: Dennis Kubes <kubes@...>
 > To: nutch-user@...
 > Sent: Saturday, February 9, 2008 11:44:34 PM
 > Subject: Re: nutch vs hadoop versions
 >
 > Ah, time to upgrade to 0.16 :)  I will put it through the tests and some
 > fetch cycles.  I was seeing some weird errors with 0.17 though for
 > injecting with no urls passing filtering, I want to make sure these
 > aren't there for 0.16.
 >
 > Dennis
 >
 > Andrzej Bialecki wrote:
 > > Kenji Kawai wrote:
 > >> Does anybody know when/how the nutch would catch up with the hadoop
 > >> versions?  Currently the nutch trunk uses hadoop 0.15.0, and result in
 > >> a runtime no-method error when run with the .17 hadoop.  We need to
 > >> use the 0.17.0 for our hbase applications.
 > >
 > > We try to upgrade Nutch to the latest official release of Hadoop soon
 > > after it becomes available. Yesterday it was 0.15.3, today it's 0.16.0
 > > ;) so in a couple days we will upgrade to that. 0.17 is still under
 > > development, so we don't plan to upgrade Nutch to this version until
 > > it's released or very close to be released.
 > >
 > >
 >
... [show rest of quote]
Reply | Threaded
Open this post in threaded view
|

How to do nutch inject?

Duan, Nick
This is a newbie question.  Please forgive me if this is already
answered somewhere.

I am trying to follow the nutch 0.8 version tutorial to run nutch
crawler over the web.  I tried to bootstrap the crawldb by injecting the
urls obtained from dmoz using the command:

bin/nutch inject crawl/crawldb dmoz

The following exception occurred:

Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: ../devel/dmoz
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Injector.run(Injector.java:192)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.crawl.Injector.main(Injector.java:182)

The exception didn't offer much details.  Any help is highly
appreciated.

ND
Reply | Threaded
Open this post in threaded view
|

Re: How to do nutch inject?

Susam Pal
You will also find a logs/hadoop.log file. Do you find any clue here?

Maybe, instead of trying to inject dmoz you can try injecting a set of
4 to 10 URLs written in a file and see the hadoop.log file and find
out what is going wrong.

Regards,
Susam Pal

On 2/20/08, Nick Duan <[hidden email]> wrote:
This is a newbie question.  Please forgive me if this is already

> answered somewhere.
>
> I am trying to follow the nutch 0.8 version tutorial to run nutch
> crawler over the web.  I tried to bootstrap the crawldb by injecting the
> urls obtained from dmoz using the command:
>
> bin/nutch inject crawl/crawldb dmoz
>
> The following exception occurred:
>
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: ../devel/dmoz
> Injector: Converting injected urls to crawl db entries.
> Injector: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:192)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:182)
>
> The exception didn\'t offer much details.  Any help is highly
> appreciated.
>
> ND
>
Reply | Threaded
Open this post in threaded view
|

jobtracker is local

Duan, Nick
Hi!  I am trying to follow the tutorial (version 0.8) and run the nutch
generate command to create a fetch list.  The result as follows:

bin/nutch generate ../devel/crawl/crawldb ../devel/crawl/segments
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: ../devel/crawl/segments/20080220163725
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.

What does "jobtracker is local" mean here?  Shouldn't be multiple
partitions generated?

Thanks!

ND



Reply | Threaded
Open this post in threaded view
|

Re: jobtracker is local

Andrzej Białecki-2
Nick Duan wrote:

> What does "jobtracker is local" mean here?  Shouldn't be multiple
> partitions generated?

This means that you are running Nutch on top of Hadoop in a so called
"local" mode (i.e. in a single JVM), so there is no point in
distributing the fetchlist to many partitions, because in any case there
will be just a single JVM working with all urls.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com