Deploying nutch

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Deploying nutch

Kevin MacDonald-3
I am trying to simply my deployment by figuring out the minimum set of files
that I need to deploy. If I create the nutch-*.job file and run it using the
bin/nutch shell script I get NoClassDefFoundError exceptions. It seems that
I have to have both the lib and build folders present. But when I look
inside the nutch-*.job file it looks like everything is there. Do I need to
modify the bin/nutch shell script to include additional things in the
classpath that are in the job file?

More generally my question is - can I deploy nutch using only the
nutch-*.job file?

Thanks

Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Deploying nutch

Kevin MacDonald-3
With a little experimentation I was able to get a deployment to work using
only the nutch-*.job file. I extracted its contents to a folder and was able
to run a crawl. However, I could only make this work by modifying the "job"
target, which for some reason excludes the hadoop files as shown below. Why
is hadoop excluded from the job file?

<target name="job" depends="compile">
    <jar jarfile="${build.dir}/${final.name}.job">
      <zipfileset dir="${build.classes}"/>
      <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
      <zipfileset dir="${lib.dir}" prefix="lib"
                  includes="**/*.jar" excludes="hadoop-*.jar"/>
      <zipfileset dir="${build.plugins}" prefix="plugins"/>
    </jar>
  </target>


On Wed, Sep 10, 2008 at 12:36 PM, Kevin MacDonald <[hidden email]>wrote:

> I am trying to simply my deployment by figuring out the minimum set of
> files that I need to deploy. If I create the nutch-*.job file and run it
> using the bin/nutch shell script I get NoClassDefFoundError exceptions. It
> seems that I have to have both the lib and build folders present. But when I
> look inside the nutch-*.job file it looks like everything is there. Do I
> need to modify the bin/nutch shell script to include additional things in
> the classpath that are in the job file?
>
> More generally my question is - can I deploy nutch using only the
> nutch-*.job file?
>
> Thanks
>
> Kevin
>
Reply | Threaded
Open this post in threaded view
|

Re: Deploying nutch

Andrzej Białecki-2
Kevin MacDonald wrote:

> With a little experimentation I was able to get a deployment to work using
> only the nutch-*.job file. I extracted its contents to a folder and was able
> to run a crawl. However, I could only make this work by modifying the "job"
> target, which for some reason excludes the hadoop files as shown below. Why
> is hadoop excluded from the job file?
>
> <target name="job" depends="compile">
>     <jar jarfile="${build.dir}/${final.name}.job">
>       <zipfileset dir="${build.classes}"/>
>       <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
>       <zipfileset dir="${lib.dir}" prefix="lib"
>                   includes="**/*.jar" excludes="hadoop-*.jar"/>
>       <zipfileset dir="${build.plugins}" prefix="plugins"/>
>     </jar>
>   </target>


Hi Kevin,

To answer your earlier question: yes, it's possible to deploy Nutch
using only the job file, but it works seamlessly only with a distributed
Hadoop cluster. That is, if you already have a Hadoop cluster (installed
using a compatible Hadoop binary release) running in non-local mode
(i.e. with a real JobTracker), then you can run Nutch on this cluster
using bin/hadoop jar nutch*.job <clasName> <args>, and you don't need
anything else from the Nutch distribution.

This doesn't work as seamlessly if you use Hadoop in local mode - in
this case you need at least to unpack the plugin/ folder and put it on
your classpath. Only then you can start using Nutch classes that depend
on plugins.

All this probably sounds messy, and it is ... my apologies. The reason
why build.xml excludes hadoop artifacts from nutch*.job is that if you
were to deploy such nutch*.job to a "clean" Hadoop cluster, suddenly you
would get two copies of the same resources (hadoop libs,
hadoop-site.xml, mapred-default.xml), one coming from the Hadoop cluster
and the other coming from the nutch*.job. Depending on your classpath
settings, it's not always clear which one would take precedence.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Deploying nutch

Kevin MacDonald-3
What seems to work for my simple usage of Nutch is to include hadoop
artifacts in the job file, and when I deploy it I extract the job file to a
folder and away I go. So, plugins do wind up in my classpath because I
extract everything. This allows me to crawl and then dump the link database
which is all I need for the moment. Unless that sounds broken I will proceed
with that. In the future we may need a Hadoop cluster and will rethink it.
Thanks for the response!

Kevin

On Wed, Sep 10, 2008 at 2:17 PM, Andrzej Bialecki <[hidden email]> wrote:

> Kevin MacDonald wrote:
>
>> With a little experimentation I was able to get a deployment to work using
>> only the nutch-*.job file. I extracted its contents to a folder and was
>> able
>> to run a crawl. However, I could only make this work by modifying the
>> "job"
>> target, which for some reason excludes the hadoop files as shown below.
>> Why
>> is hadoop excluded from the job file?
>>
>> <target name="job" depends="compile">
>>    <jar jarfile="${build.dir}/${final.name}.job">
>>      <zipfileset dir="${build.classes}"/>
>>      <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
>>      <zipfileset dir="${lib.dir}" prefix="lib"
>>                  includes="**/*.jar" excludes="hadoop-*.jar"/>
>>      <zipfileset dir="${build.plugins}" prefix="plugins"/>
>>    </jar>
>>  </target>
>>
>
>
> Hi Kevin,
>
> To answer your earlier question: yes, it's possible to deploy Nutch using
> only the job file, but it works seamlessly only with a distributed Hadoop
> cluster. That is, if you already have a Hadoop cluster (installed using a
> compatible Hadoop binary release) running in non-local mode (i.e. with a
> real JobTracker), then you can run Nutch on this cluster using bin/hadoop
> jar nutch*.job <clasName> <args>, and you don't need anything else from the
> Nutch distribution.
>
> This doesn't work as seamlessly if you use Hadoop in local mode - in this
> case you need at least to unpack the plugin/ folder and put it on your
> classpath. Only then you can start using Nutch classes that depend on
> plugins.
>
> All this probably sounds messy, and it is ... my apologies. The reason why
> build.xml excludes hadoop artifacts from nutch*.job is that if you were to
> deploy such nutch*.job to a "clean" Hadoop cluster, suddenly you would get
> two copies of the same resources (hadoop libs, hadoop-site.xml,
> mapred-default.xml), one coming from the Hadoop cluster and the other coming
> from the nutch*.job. Depending on your classpath settings, it's not always
> clear which one would take precedence.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
Reply | Threaded
Open this post in threaded view
|

nutch speed problem

zhengping deng

hi,    I am running the nutch-0.8.1 on 4 machines (HP DL380 G4, 2c/4G) with 1 namenode, 3 datanodes.  I am using them to crawl the internat now. But I find it is too slow.  I have spent nearly 8 hours to crawl, but the size of crawl directory is only 1.1G on HDFS.  If I use single machine nutch to crawl, it could be more than 10G. If I use larbin, it can more than 100G on one of my machine.    What is the problem?  How can I improve the speed of nutch in distributed mode.  Can nutch/hadoop be really competitive in SE?      My main config is blow:       hadoop-site.xml:           mapred.map.tasks   15           mapred.reduce.tasks 15           dfs.replication  2       nutch-site.xml:           fetcher.threads.fetch   500           fetcher.threads.per.host  20           parser.threads.parse 500     Thank you for all your help.
Mark Deng
_________________________________________________________________
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE
Reply | Threaded
Open this post in threaded view
|

how to improve nutch crawl speed?

zhengping deng
In reply to this post by Kevin MacDonald-3

hi,    I am running the nutch-0.8.1 on 4 machines (HP DL380 G4, 2c/4G) with 1 namenode, 3 datanodes.  I am using them to crawl the internat now. But I find it is too slow.  I have spent nearly 8 hours to crawl, but the size of crawl directory is only 1.1G on HDFS.  If I use single machine nutch to crawl, it could be more than 10G. If I use larbin, it can more than 100G on one of my machine.    What is the problem?  How can I improve the speed of nutch in distributed mode.  Can nutch/hadoop be really competitive in SE?      My main config is blow:       hadoop-site.xml:           mapred.map.tasks   15           mapred.reduce.tasks 15           dfs.replication  2       nutch-site.xml:           fetcher.threads.fetch   500           fetcher.threads.per.host  20           parser.threads.parse 500     Thank you for all your help.Mark Deng

Discover the new Windows Vista Learn more!
_________________________________________________________________
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE
Reply | Threaded
Open this post in threaded view
|

RE: how to improve nutch crawl speed?

Edward Quick


I'm not expert on Nutch but if this is the same problem I had, try setting fetcher.server.delay to 0.1 in nutch-site.xml

Ed.

> From: [hidden email]
> To: [hidden email]
> Subject: how to improve nutch crawl speed?
> Date: Thu, 11 Sep 2008 14:54:20 +0000
>
>
> hi,    I am running the nutch-0.8.1 on 4 machines (HP DL380 G4, 2c/4G) with 1 namenode, 3 datanodes.  I am using them to crawl the internat now. But I find it is too slow.  I have spent nearly 8 hours to crawl, but the size of crawl directory is only 1.1G on HDFS.  If I use single machine nutch to crawl, it could be more than 10G. If I use larbin, it can more than 100G on one of my machine.    What is the problem?  How can I improve the speed of nutch in distributed mode.  Can nutch/hadoop be really competitive in SE?      My main config is blow:       hadoop-site.xml:           mapred.map.tasks   15           mapred.reduce.tasks 15           dfs.replication  2       nutch-site.xml:           fetcher.threads.fetch   500           fetcher.threads.per.host  20           parser.threads.parse 500     Thank you for all your help.Mark Deng
>
> Discover the new Windows Vista Learn more!
> _________________________________________________________________
> Discover the new Windows Vista
> http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/