Nutch 2.x does not send index to ElasticSearch 2.3.3

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch 2.x does not send index to ElasticSearch 2.3.3

devil devil
Hello, 
    I am running nutch 2.x and elasticsearch 2.3.3 in two containers. I can log into nutch container and curl E.S. so connectivity is there. Inject/Fetch/etc all work fine. However when i get to nutch index elasticsearch, all i get is:
 
    root@b211135e1be5:~/nutch/bin# ./nutch index elasticsearch -all
    IndexingJob: starting
    Active IndexWriters :
    ElasticIndexWriter
         elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port  (default 9300)
        elastic.index : elastic index command 
        elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
 
   I tried various E.S. versions and various combinations of settings, but still getting nowhere.  
   My elasticsearch.conf is empty (should I have something here?)
   Below is my nutch-site.xml (I was using indexer-elastic before but was getting the "No indexwriters found" errors. Then I saw there is indexer-elastic2 plugin)
 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic2</value>
    <description>plugins</description>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>Crawler</value>
  </property>  
  <property>
    <name>http.robots.agents</name>
    <value>Crawler,*</value>
  </property>  
  <property>
    <name>http.robots.403.allow</name>
    <value>true</value>
  </property>
  <property>
    <name>http.timeout</name>
    <value>120000</value>
    <description>The default network timeout, in milliseconds.</description>
  </property>
  <property>
    <name>http.useHttp11</name>
    <value>true</value>
  </property>
  <property>
    <name>http.content.limit</name>
    <value>-1</value>
  </property>
  <property>
    <name>file.content.limit</name>
    <value>-1</value>
  </property>
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
  </property>
  <property>
    <name>db.ignore.external.links.mode</name>
    <value>byDomain</value>
  </property>
  <property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
  </property>
  <property>
    <name>generate.update.crawldb</name>
    <value>true</value>
  </property>
  <property>
    <name>fetcher.threads.fetch</name>
    <value>10</value>
  </property>
  <property>
    <name>fetcher.threads.per.queue</name>
    <value>10</value>
  </property>
  <property>
    <name>fetcher.server.delay</name>
    <value>1.0</value>
    <description>The number of seconds the fetcher will delay between 
     successive requests to the same server.</description>
  </property>
  <property>
    <name>fetcher.threads.per.host</name>
    <value>10</value>
    <description>This number is the maximum number of threads that
      should be allowed to access a host at one time.</description>
  </property>  
  <property>
    <name>db.fetch.interval.default</name>
    <value>18000</value>
    <description>The number of seconds between re-fetches of a page (5hours).</description>
  </property>  
  <property>
    <name>db.fetch.interval.max</name>
    <value>43200</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>172.20.128.4</value>
  </property>
  <property>
    <name>elastic.port</name>
    <value>9300</value>
  </property>
  <property>
    <name>elastic.cluster</name>
    <value>elasticsearch</value>
  </property>
  <property>
    <name>elastic.index</name>
    <value>nutchindex</value>
  </property>
  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>
  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
  <property>
    <name>elastic.max.bulk.docs</name>
    <value>250</value>
    <description>Maximum size of the bulk in number of documents.</description>
  </property>
  <property>
    <name>elastic.max.bulk.size</name>
    <value>2500500</value>
    <description>Maximum size of the bulk in bytes.</description>
  </property>
</configuration>
 
   
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.x does not send index to ElasticSearch 2.3.3

lewis john mcgibbney-2
Hi Devil,
Do your logs indicate any issues?
Lewis

On Mon, Dec 25, 2017 at 5:41 PM, <[hidden email]> wrote:

>
> ---------- Forwarded message ----------
> From: devil devil <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Fri, 22 Dec 2017 21:24:51 +0100
> Subject: Nutch 2.x does not send index to ElasticSearch 2.3.3
> Hello,
>     I am running nutch 2.x and elasticsearch 2.3.3 in two containers. I
> can log into nutch container and curl E.S. so connectivity is there.
> Inject/Fetch/etc all work fine. However when i get to nutch index
> elasticsearch, all i get is:
>
>     root@b211135e1be5:~/nutch/bin# ./nutch index elasticsearch -all
>     IndexingJob: starting
>     Active IndexWriters :
>     ElasticIndexWriter
>          elastic.cluster : elastic prefix cluster
>         elastic.host : hostname
>         elastic.port : port  (default 9300)
>         elastic.index : elastic index command
>         elastic.max.bulk.docs : elastic bulk index doc counts. (default
> 250)
>         elastic.max.bulk.size : elastic bulk index length. (default
> 2500500 ~2.5MB)
>
>    I tried various E.S. versions and various combinations of settings, but
> still getting nowhere.
>    My elasticsearch.conf is empty (should I have something here?)
>    Below is my nutch-site.xml (I was using indexer-elastic before but was
> getting the "No indexwriters found" errors. Then I saw there is
> indexer-elastic2 plugin)
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.x does not send index to ElasticSearch 2.3.3

devil devil
Hi Lewis,

   Looking through the logs I did find the error below. I was reading that if nutch cant find elasticsearch it will default to Solr (which explains the last portion of the error)
   I dont understand why nutch cant find the ES node.
   I have verified that:

      1) ES port is 9300 (in nutch-site.xml)
      2) ES clustername is same (in nutch-site.xml and http://localhost:9200)
      3) I have static IPs in my docker-compose.yml and from nutch container i can ping 172.20.128.4 (E.S. container ip).

Thanks
2017-12-28 14:01:32,986 INFO  elastic2.ElasticIndexWriter - Processing remaining requests [docs = 116, length = 1133796, total docs = 116]
2017-12-28 14:01:32,987 INFO  elastic2.ElasticIndexWriter - Processing remaining requests [docs = 116, length = 1133796, total docs = 116]
2017-12-28 14:01:32,988 WARN  mapred.LocalJobRunner - job_local387272193_0001
java.lang.Exception: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{172.20.128.4}{172.20.128.4:9300}]]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{172.20.128.4}{172.20.128.4:9300}]]
    at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:290)
    at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:207)
    at org.elasticsearch.client.transport.support.TransportProxyClient.execute(TransportProxyClient.java:55)
    at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:286)
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:351)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
    at org.apache.nutch.indexwriter.elastic2.ElasticIndexWriter.commit(ElasticIndexWriter.java:208)
    at org.apache.nutch.indexwriter.elastic2.ElasticIndexWriter.close(ElasticIndexWriter.java:226)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:116)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
    at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2017-12-28 14:01:33,847 ERROR indexer.IndexingJob - SolrIndexerJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.4-SNAPSHOT.jar, jobid=job_local387272193_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:158)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:197)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:206)
 

Sent: Tuesday, December 26, 2017 at 9:34 AM
From: "lewis john mcgibbney" <[hidden email]>
To: [hidden email]
Subject: Re: Nutch 2.x does not send index to ElasticSearch 2.3.3
Hi Devil,
Do your logs indicate any issues?
Lewis

On Mon, Dec 25, 2017 at 5:41 PM, <[hidden email]> wrote:

>
> ---------- Forwarded message ----------
> From: devil devil <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Fri, 22 Dec 2017 21:24:51 +0100
> Subject: Nutch 2.x does not send index to ElasticSearch 2.3.3
> Hello,
> I am running nutch 2.x and elasticsearch 2.3.3 in two containers. I
> can log into nutch container and curl E.S. so connectivity is there.
> Inject/Fetch/etc all work fine. However when i get to nutch index
> elasticsearch, all i get is:
>
> root@b211135e1be5:~/nutch/bin# ./nutch index elasticsearch -all
> IndexingJob: starting
> Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port (default 9300)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default
> 250)
> elastic.max.bulk.size : elastic bulk index length. (default
> 2500500 ~2.5MB)
>
> I tried various E.S. versions and various combinations of settings, but
> still getting nowhere.
> My elasticsearch.conf is empty (should I have something here?)
> Below is my nutch-site.xml (I was using indexer-elastic before but was
> getting the "No indexwriters found" errors. Then I saw there is
> indexer-elastic2 plugin)
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 2.x does not send index to ElasticSearch 2.3.3

lewis john mcgibbney-2
In reply to this post by devil devil
Hi Devil,
Does this help you out?
https://stackoverflow.com/questions/33691858/elasticsearch-nonodeavailableexception
Lewis

On Thu, Dec 28, 2017 at 2:38 PM, <[hidden email]> wrote:

> From: devil devil <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Thu, 28 Dec 2017 15:38:13 +0100
> Subject: Re: Nutch 2.x does not send index to ElasticSearch 2.3.3
> Hi Lewis,
>
>    Looking through the logs I did find the error below. I was reading that
> if nutch cant find elasticsearch it will default to Solr (which explains
> the last portion of the error)
>    I dont understand why nutch cant find the ES node.
>    I have verified that:
>
>       1) ES port is 9300 (in nutch-site.xml)
>       2) ES clustername is same (in nutch-site.xml and
> http://localhost:9200)
>       3) I have static IPs in my docker-compose.yml and from nutch
> container i can ping 172.20.128.4 (E.S. container ip).
>
> Thanks
> 2017-12-28 14:01:32,986 INFO  elastic2.ElasticIndexWriter - Processing
> remaining requests [docs = 116, length = 1133796, total docs = 116]
> 2017-12-28 14:01:32,987 INFO  elastic2.ElasticIndexWriter - Processing
> remaining requests [docs = 116, length = 1133796, total docs = 116]
> 2017-12-28 14:01:32,988 WARN  mapred.LocalJobRunner -
> job_local387272193_0001
> java.lang.Exception: NoNodeAvailableException[None of the configured nodes
> are available: [{#transport#-1}{172.20.128.4}{172.20.128.4:9300}]]
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
> LocalJobRunner.java:462)
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java:522)
> Caused by: NoNodeAvailableException[None of the configured nodes are
> available: [{#transport#-1}{172.20.128.4}{172.20.128.4:9300}]]
>     at org.elasticsearch.client.transport.TransportClientNodesService.
> ensureNodesAreAvailable(TransportClientNodesService.java:290)
>     at org.elasticsearch.client.transport.TransportClientNodesService.
> execute(TransportClientNodesService.java:207)
>     at org.elasticsearch.client.transport.support.
> TransportProxyClient.execute(TransportProxyClient.java:55)
>     at org.elasticsearch.client.transport.TransportClient.
> doExecute(TransportClient.java:286)
>     at org.elasticsearch.client.support.AbstractClient.
> execute(AbstractClient.java:351)
>     at org.elasticsearch.action.ActionRequestBuilder.execute(
> ActionRequestBuilder.java:85)
>     at org.elasticsearch.action.ActionRequestBuilder.execute(
> ActionRequestBuilder.java:59)
>     at org.apache.nutch.indexwriter.elastic2.ElasticIndexWriter.
> commit(ElasticIndexWriter.java:208)
>     at org.apache.nutch.indexwriter.elastic2.ElasticIndexWriter.
> close(ElasticIndexWriter.java:226)
>     at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:116)
>     at org.apache.nutch.indexer.IndexerOutputFormat$1.close(
> IndexerOutputFormat.java:54)
>     at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.
> close(MapTask.java:647)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(
> LocalJobRunner.java:243)
>     at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 2017-12-28 14:01:33,847 ERROR indexer.IndexingJob - SolrIndexerJob:
> java.lang.RuntimeException: job failed: name=apache-nutch-2.4-SNAPSHOT.jar,
> jobid=job_local387272193_0001
>     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:158)
>     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:197)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:206)
>
>
> Sent: Tuesday, December 26, 2017 at 9:34 AM
> From: "lewis john mcgibbney" <[hidden email]>
> To: [hidden email]
> Subject: Re: Nutch 2.x does not send index to ElasticSearch 2.3.3
> Hi Devil,
> Do your logs indicate any issues?
> Lewis
>
> On Mon, Dec 25, 2017 at 5:41 PM, <[hidden email]>
> wrote:
>
> >
> > ---------- Forwarded message ----------
> > From: devil devil <[hidden email]>
> > To: [hidden email]
> > Cc:
> > Bcc:
> > Date: Fri, 22 Dec 2017 21:24:51 +0100
> > Subject: Nutch 2.x does not send index to ElasticSearch 2.3.3
> > Hello,
> > I am running nutch 2.x and elasticsearch 2.3.3 in two containers. I
> > can log into nutch container and curl E.S. so connectivity is there.
> > Inject/Fetch/etc all work fine. However when i get to nutch index
> > elasticsearch, all i get is:
> >
> > root@b211135e1be5:~/nutch/bin# ./nutch index elasticsearch -all
> > IndexingJob: starting
> > Active IndexWriters :
> > ElasticIndexWriter
> > elastic.cluster : elastic prefix cluster
> > elastic.host : hostname
> > elastic.port : port (default 9300)
> > elastic.index : elastic index command
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default
> > 250)
> > elastic.max.bulk.size : elastic bulk index length. (default
> > 2500500 ~2.5MB)
> >
> > I tried various E.S. versions and various combinations of settings, but
> > still getting nowhere.
> > My elasticsearch.conf is empty (should I have something here?)
> > Below is my nutch-site.xml (I was using indexer-elastic before but was
> > getting the "No indexwriters found" errors. Then I saw there is
> > indexer-elastic2 plugin)
> >
> >
>
>
>


--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney