Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.

Puneet Dhanda
Hi,

I am using the Nutch- 2.3.1 with MongoDB as the datastore. While crawling
the sites, getting the following error. Please assist what could be wrong
here.

Hadoop.log exception
2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - Retrying
request
2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - I/O exception
(java.net.ConnectException) caught when processing request: Connection
refused (Connection refused)
2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - Retrying
request
2018-08-15 09:56:42,242 ERROR httpclient.Http - Failed with the following
error:
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
2018-08-15 09:56:46,409 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active,
2 pages, 2 errors, 0.4 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues


No Solr Document Indexed
2018-08-15 09:57:01,318 INFO  solr.SolrIndexWriter - Total 0 document is
added.


MongoDB
> show dbs
admin    0.000GB
config   0.000GB
local    0.000GB
nutchdb  0.000GB
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.

lewis john mcgibbney-2
Hi Puneet
Responses inline

On Wed, Aug 15, 2018 at 7:20 AM <[hidden email]> wrote:

>
> From: Puneet Dhanda <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Wed, 15 Aug 2018 10:02:12 -0400
> Subject: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.
> Hi,
>
> I am using the Nutch- 2.3.1 with MongoDB as the datastore.


Are you using it from SCM or the release? If I were you I would use from
SCM, we fixed a few bugs in there.


> While crawling
> the sites, getting the following error. Please assist what could be wrong
> here.
>
> Hadoop.log exception
> 2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - Retrying
> request
> 2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - I/O exception
> (java.net.ConnectException) caught when processing request: Connection
> refused (Connection refused)
> 2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - Retrying
> request
> 2018-08-15 09:56:42,242 ERROR httpclient.Http - Failed with the following
> error:
> java.net.ConnectException: Connection refused (Connection refused)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net
> .AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net
> .AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> 2018-08-15 09:56:46,409 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active,
> 2 pages, 2 errors, 0.4 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
>

You may wish to use the parser checker tooling to ensure that you can reach
the 2 failed URLs without executing a full crawl
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
Also, you can try setting DEBUG or TRACE logging for this tool, see
 https://github.com/apache/nutch/blob/2.x/conf/log4j.properties#L40
Lewis