Quantcast

Nutch 1.11 redirects and solr uniqueKey problems

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Nutch 1.11 redirects and solr uniqueKey problems

André Schild
Hello,

we have a working installation of nutch 1.6 and solr 4.0.0
Now we did try to upgrade to nutch 1.11 and solr 6.4.0.

So far crawling works with 1.11 as intended, but adding the documents to solr fail because of the unique constraint of the id field.

We see this error when nutch trys to submit to solr:


java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Document contains multiple values for uniqueKey field: id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Document contains multiple values for uniqueKey field: id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-01-30 12:16:41,274 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

The url http://www.mysite.ch redirects with a 301 status to http://www.mysite.ch/de/start.html

My solrindex-mapping.xml looks like this:

<mapping>
  <fields>
                <field dest="fullContent" source="content" />
                <field dest="content" source="strippedContent" />
                <field dest="title" source="title"/>
               <field dest="host" source="host"/>
                <field dest="segment" source="segment"/>
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
                <field dest="id" source="url"/>
                <field dest="lang" source="lang"/>
                <field dest="metatag-description" source="metatag.description" />
                <field dest="metatag-keywords" source="metatag.keywords" />
                <copyField source="url" dest="url"/>
  </fields>
  <uniqueKey>id</uniqueKey>
</mapping>

And the (relevant parts of the) solr schema:

  <uniqueKey>id</uniqueKey>

I see why this causes problems.
How can I tell nutch to submit only one URL (Ideally the original url) to solr, and not both?


André Schild

Aarboard AG<http://www.aarboard.ch/>
Egliweg 10
2560 Nidau
Switzerland
+41 32 332 97 14<tel:+41323329714>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Nutch 1.11 redirects and solr uniqueKey problems

Sebastian Nagel
Hi André,

have a look on the changes made to address NUTCH-1708 [1] [2]
and try
      <field dest="id" source="id"/>
instead of
      <field dest="id" source="url"/>

Thanks,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-1708
[2] https://github.com/apache/nutch/commit/bad0a2076a8c724a0542b923ac10bb812c0de644?diff=unified

On 01/30/2017 12:26 PM, André Schild wrote:

> Hello,
>
> we have a working installation of nutch 1.6 and solr 4.0.0
> Now we did try to upgrade to nutch 1.11 and solr 6.4.0.
>
> So far crawling works with 1.11 as intended, but adding the documents to solr fail because of the unique constraint of the id field.
>
> We see this error when nutch trys to submit to solr:
>
>
> java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Document contains multiple values for uniqueKey field: id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Document contains multiple values for uniqueKey field: id=[http://www.mysite.ch/de/start.html, http://www.mysite.ch/]
>         at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
>         at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
>         at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
>         at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>         at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
>         at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>         at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
>         at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
>         at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-01-30 12:16:41,274 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>
> The url http://www.mysite.ch redirects with a 301 status to http://www.mysite.ch/de/start.html
>
> My solrindex-mapping.xml looks like this:
>
> <mapping>
>   <fields>
>                 <field dest="fullContent" source="content" />
>                 <field dest="content" source="strippedContent" />
>                 <field dest="title" source="title"/>
>                <field dest="host" source="host"/>
>                 <field dest="segment" source="segment"/>
>                 <field dest="boost" source="boost"/>
>                 <field dest="digest" source="digest"/>
>                 <field dest="tstamp" source="tstamp"/>
>                 <field dest="id" source="url"/>
>                 <field dest="lang" source="lang"/>
>                 <field dest="metatag-description" source="metatag.description" />
>                 <field dest="metatag-keywords" source="metatag.keywords" />
>                 <copyField source="url" dest="url"/>
>   </fields>
>   <uniqueKey>id</uniqueKey>
> </mapping>
>
> And the (relevant parts of the) solr schema:
>
>   <uniqueKey>id</uniqueKey>
>
> I see why this causes problems.
> How can I tell nutch to submit only one URL (Ideally the original url) to solr, and not both?
>
>
> André Schild
>
> Aarboard AG<http://www.aarboard.ch/>
> Egliweg 10
> 2560 Nidau
> Switzerland
> +41 32 332 97 14<tel:+41323329714>
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AW: Nutch 1.11 redirects and solr uniqueKey problems

André Schild
Hello Sebastain,

>Hi André,
>
>have a look on the changes made to address NUTCH-1708 [1] [2] and try
>      <field dest="id" source="id"/>
>instead of
>      <field dest="id" source="url"/>
>

Thanks, this solved the problem.

André
Loading...