Crawl failing when using hadoop

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawl failing when using hadoop

Karthik Ramesh
Hi,

I have just started using hadoop for performing nutch crawls on a cluster of 5 servers. I am using nutch 0.9.
I have gone through the initial setup as told in http://wiki.apache.org/nutch/NutchHadoopTutorial.

I am also able to start all the servers using the start-all.sh and also upload the list of urls to the dfs. But after I initiate the crawl,
I get the following exception,

crawl started in: crawled
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:519)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:149)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:531)
        at org.apache.hadoop.ipc.Client.call(Client.java:458)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
        at $Proxy1.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:247)
        at org.apache.hadoop.mapred.JobClient.init(JobClient.java:208)
        at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:200)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:528)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)


Any idea where I could be going wrong?
Thanks,

- Karthik.




      Save all your chat conversations. Find them online at http://in.messenger.yahoo.com/webmessengerpromo.php
Reply | Threaded
Open this post in threaded view
|

Re: Crawl failing when using hadoop

balachanthar
hi ramesh

check the urls file and crawl-urlfilter.txt files for any spaces or blank
lines.

thank you

On Feb 10, 2008 6:20 PM, Karthik Ramesh <[hidden email]>
wrote:

> Hi,
>
> I have just started using hadoop for performing nutch crawls on a cluster
> of 5 servers. I am using nutch 0.9.
> I have gone through the initial setup as told in
> http://wiki.apache.org/nutch/NutchHadoopTutorial.
>
> I am also able to start all the servers using the start-all.sh and also
> upload the list of urls to the dfs. But after I initiate the crawl,
> I get the following exception,
>
> crawl started in: crawled
> rootUrlDir = urls
> threads = 10
> depth = 3
> Injector: starting
> Injector: crawlDb: crawled/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.net.ConnectException: Connection refused
>        at java.net.PlainSocketImpl.socketConnect(Native Method)
>        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java
> :195)
>        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>        at java.net.Socket.connect(Socket.java:519)
>        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(
> Client.java:149)
>        at org.apache.hadoop.ipc.Client.getConnection(Client.java:531)
>        at org.apache.hadoop.ipc.Client.call(Client.java:458)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>        at $Proxy1.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:247)
>        at org.apache.hadoop.mapred.JobClient.init(JobClient.java:208)
>        at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:200)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:528)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
>
>
> Any idea where I could be going wrong?
> Thanks,
>
> - Karthik.
>
>
>
>
>      Save all your chat conversations. Find them online at
> http://in.messenger.yahoo.com/webmessengerpromo.php