Nutch and Hadoop

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch and Hadoop

payo
hi

i am working with nutch-0.8.1 and i am trying configure hadoop but my questions are:

-in the directory bin exist the files:

 hadoop, hadoop-daemon, hadoop-daemons, nutch, rcc, slaves, start-all, start-dfs, start-mapred, stop-all, stop-dfs, stop-mapred

 this files are necesary for run nutch with hadoop?

my base is

http://wiki.apache.org/nutch/NutchHadoopTutorial

thanks

or i have download hadoop and make install?



Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

John Mendenhall
> i am working with nutch-0.8.1 and i am trying configure hadoop but my
> questions are:
>
> -in the directory bin exist the files:
>
>  hadoop, hadoop-daemon, hadoop-daemons, nutch, rcc, slaves, start-all,
> start-dfs, start-mapred, stop-all, stop-dfs, stop-mapred
>
>  this files are necesary for run nutch with hadoop?
>
> my base is
>
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> thanks
>
> or i have download hadoop and make install?

Everything you need to run hadoop with nutch is in the
nutch download, at least it is with nutch 0.9.  The
item you list above in the bin directory are all the
same ones I used to get hadoop going.

Make sure you follow all the directions in the tutorial.
There are also several others that basically say the same
thing, so the instructions are good.

Make sure you understand your configuration files and
what you are setting.

JohnM

--
john mendenhall
[hidden email]
surf utopia
internet services
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
i am trying configure nutch and hadoop in two pc, but i have questions:

1.- i have install nutch in the two pcs or only in the master node?

2.- hadoop helpme to reduce times in my crawl and my search?

3.- Only i need create keys for communication with my pcs

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
when execute ant package

in the tutorial

To build nutch call the package ant task like this:

ant package

showme


[xslt] Processing /home/admvdta/nutch-0.8/conf/nutch-default.xml to /home/admvdta/nutch-0.8/build/nutch.xml
     [xslt] Loading stylesheet /home/admvdta/nutch-0.8/conf/context.xsl
     [xslt] [Fatal Error] context.xsl:18:23: Invalid byte 2 of 3-byte UTF-8 sequence.
     [xslt] : Error! Invalid byte 2 of 3-byte UTF-8 sequence.
     [xslt] : Fatal Error! Could not compile stylesheet
     [xslt] Failed to process /home/admvdta/nutch-0.8/conf/nutch-default.xml


BUILD FAILED
/home/admvdta/nutch-0.8/build.xml:151: Fatal error during transformation


the line 151 is this:

149 <xslt in="${basedir}/conf/nutch-default.xml"
150          out="${build.dir}/nutch.xml"
151          style="${basedir}/conf/context.xsl">
152        <xmlcatalog refid="docDTDs"/>
153     <outputproperty name="indent" value="yes"/>
154    </xslt>

what is the problem?

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
i resolved the problem!!

i change in conf/context.xsl

<?xml version="1.0" encoding="UTF-8"?>

by

<?xml version="1.0" encoding="iso-8859-1"?>

this is correct?

i read this

http://www.openrdf.org/doc/sesame/users/ch09.html#d0e3707
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
hi to all

how can configure hadoop-site.xml in the part:

1.- fs.default.name
2.- mapred.job.tracker
3.- mapred.tasktracker.tasks.maximum

in general hadoop-site i am working wiht two machines one master node an one as slave

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
i created my ssh keys and i can login over ssh without being prompted for a password on the slave node

but when i execute on master node

./bin/start-all.sh

showme this

[user@emcvaalkm01 search]# ./bin/start-all.sh
starting namenode, logging to /nutch-0.8.1/search/logs/hadoop-root-namenode-emcvaalkm01.estafeta.com.out
user@localhost's password:


what is the problem
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

Barry Haddow
Hi

You need to add your public key to the .ssh/authorized_keys on the master as
well as the slave.  Also, make sure that this file is not writable by anyone
else but you.

regards
Barry

On Thursday 07 February 2008, payo wrote:

> i created my ssh keys and i can login over ssh without being prompted for a
> password on the slave node
>
> but when i execute on master node
>
> ./bin/start-all.sh
>
> showme this
>
> [user@emcvaalkm01 search]# ./bin/start-all.sh
> starting namenode, logging to
> /nutch-0.8.1/search/logs/hadoop-root-namenode-emcvaalkm01.estafeta.com.out
> user@localhost's password:
>
>
> what is the problem

Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
i generated my public keys with ssh-keygen -t rsa

i can entry on master-slave  and slave master via ssh  withaout password but when execute start-all.sh

showme this


[nutch@emcvaalkm01 search]$ ./bin/start-all.sh
namenode running as process 23240. Stop it first.
nutch@localhost's password:
localhost: starting datanode, logging to /nutch-0.8.1/search/logs/hadoop-nutch-datanode-emcvaalkm01.estafeta.com.out
jobtracker running as process 23277. Stop it first.
nutch@localhost's password:
localhost: starting tasktracker, logging to /nutch-0.8.1/search/logs/hadoop-nutch-tasktracker-emcvaalkm01.estafeta.com.out


also when execute this:


[nutch@emcvaalkm01 search]$ ./bin/hadoop dfs -put urls urls
put: Connection refused

what is the problem?

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
hi to all

i solved this

[nutch@emcvaalkm01 search]$ ./bin/start-all.sh
starting namenode, logging to /nutch-0.8.1/search/logs/hadoop-nutch-namenode-emcvaalkm01.estafeta.com.out
192.168.200.73: starting datanode, logging to /nutch-0.8.1/search/logs/hadoop-nutch-datanode-esclavo.estafeta.com.out
starting jobtracker, logging to /nutch-0.8.1/search/logs/hadoop-nutch-jobtracker-emcvaalkm01.estafeta.com.out
192.168.200.73: starting tasktracker, logging to /nutch-0.8.1/search/logs/hadoop-nutch-tasktracker-esclavo.estafeta.com.out
[nutch@emcvaalkm01 search]$ ./bin/stop-all.sh



but any idea with this

[nutch@emcvaalkm01 search]$ ./bin/hadoop dfs -put urls urls
put: Connection refused

solved !!!!



but now showme this
[nutch@emcvaalkm01 search]$ ./bin/nutch crawl urls -dir crawled -depth 3
crawl started in: crawled
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: failed to create file /user/nutch/$/nutch-0.8.1/filesystem/mapreduce/system/submit_byrgr6/.job.jar.crc on client emcvaalkm01.estafeta.com because target-length is 0, below MIN_REPLICATION (1)
        at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:388)
        at org.apache.hadoop.dfs.NameNode.create(NameNode.java:159)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:243)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:469)

        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:159)

what is the problem??


thanks

Reply | Threaded
Open this post in threaded view
|

Re: Nutch and Hadoop

payo
now i have this:

[nutch@emcvaalkm01 search]$ ./bin/nutch crawl urls -dir crawled -depth 3
crawl started in: crawled
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


any idea?

thanks