invalid utf8 chars when indexing or cleaning

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

invalid utf8 chars when indexing or cleaning

Michael Coffey
Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job:  map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_000007_2, Status : FAILED
Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [
com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #104705, byte #219135)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)

Messages during cleaning look like this.
17/08/22 09:24:01 INFO mapreduce.Job:  map 100% reduce 92%17/08/22 09:25:57 INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status : FAILEDError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #16099, byte #16383)        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)        at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.
Reply | Threaded
Open this post in threaded view
|

Re: invalid utf8 chars when indexing or cleaning

Michael Coffey
Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 bug that was fixed in version 1.4.
Some more bits of information: the indexer job rarely fails (only 1 of the last 99 segments) but the cleaning job fails every time now. Once again, this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of mismatch of versions?


     To: User <[hidden email]>
 Sent: Thursday, August 24, 2017 7:42 PM
 Subject: invalid utf8 chars when indexing or cleaning
   
Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job:  map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_000007_2, Status : FAILED
Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [
com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #104705, byte #219135)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
        at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)

Messages during cleaning look like this.
17/08/22 09:24:01 INFO mapreduce.Job:  map 100% reduce 92%17/08/22 09:25:57 INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status : FAILEDError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #16099, byte #16383)        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)        at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)        at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)        at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.


   
Reply | Threaded
Open this post in threaded view
|

Re: invalid utf8 chars when indexing or cleaning

Jorge Betancourt
 From the logs looks like the error is coming from the Solr side, do you
mind checking/sharing the logs on your Solr server? Can you pin point which
URL is causing the issue?
Best Regards, Jorge

On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <[hidden email]>
wrote:
Does anybody have any thoughts on this? It seems similar to the NUTCH-1016
bug that was fixed in version 1.4.
Some more bits of information: the indexer job rarely fails (only 1 of the
last 99 segments) but the cleaning job fails every time now. Once again,
this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and
Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of
mismatch of versions?


To: User <[hidden email]>
Sent: Thursday, August 24, 2017 7:42 PM
Subject: invalid utf8 chars when indexing or cleaning

Lately, I have seen many tasks and jobs fail in Solr when doing nutch index
and nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id :
attempt_1502929850483_1329_r_000007_2, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://codero4.neocortix.com:8984/solr/popular: [
com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
#104705, byte #219135)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)

Messages during cleaning look like this.
17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57
INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status
: FAILEDError:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://codero4.neocortix.com:8984/solr/popular: 
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
#16099, byte #16383) at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at
org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1.
I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing
this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.
Reply | Threaded
Open this post in threaded view
|

RE: invalid utf8 chars when indexing or cleaning

Markus Jelsma-2
In reply to this post by Michael Coffey
The bug is identical, but i fixed it! You should verify the output Nutch generates and inspect it manually, there should be a 0xffff at that byte. If it really is there, we need to check the fix once more, despite that i am sure the patch works as intended.

Get the XML, pass it through the method and see what it does to the output.

 
 
-----Original message-----

> From:Jorge Betancourt <[hidden email]>
> Sent: Tuesday 29th August 2017 21:54
> To: [hidden email]
> Subject: Re: invalid utf8 chars when indexing or cleaning
>
>  From the logs looks like the error is coming from the Solr side, do you
> mind checking/sharing the logs on your Solr server? Can you pin point which
> URL is causing the issue?
> Best Regards, Jorge
>
> On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <[hidden email]>
> wrote:
> Does anybody have any thoughts on this? It seems similar to the NUTCH-1016
> bug that was fixed in version 1.4.
> Some more bits of information: the indexer job rarely fails (only 1 of the
> last 99 segments) but the cleaning job fails every time now. Once again,
> this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and
> Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of
> mismatch of versions?
>
>
> To: User <[hidden email]>
> Sent: Thursday, August 24, 2017 7:42 PM
> Subject: invalid utf8 chars when indexing or cleaning
>
> Lately, I have seen many tasks and jobs fail in Solr when doing nutch index
> and nutch clean.
> Messages during indexing look like this.
> 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99%
> 17/08/24 19:19:36 INFO mapreduce.Job: Task Id :
> attempt_1502929850483_1329_r_000007_2, Status : FAILED
> Error:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://codero4.neocortix.com:8984/solr/popular: [
> com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
> #104705, byte #219135)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
> at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
>
> Messages during cleaning look like this.
> 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57
> INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status
> : FAILEDError:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://codero4.neocortix.com:8984/solr/popular: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
> #16099, byte #16383) at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
> Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1.
> I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing
> this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.
Reply | Threaded
Open this post in threaded view
|

Re: invalid utf8 chars when indexing or cleaning

Michael Coffey
It sounds like a good suggestion, but I don't know what you mean by "verify the output Nutch generates and inspect it manually." How do I get a look at that XML?


      From:
 To: "[hidden email]" <[hidden email]>
 Sent: Thursday, August 31, 2017 11:59 AM
 Subject: RE: invalid utf8 chars when indexing or cleaning
   
The bug is identical, but i fixed it! You should verify the output Nutch generates and inspect it manually, there should be a 0xffff at that byte. If it really is there, we need to check the fix once more, despite that i am sure the patch works as intended.

Get the XML, pass it through the method and see what it does to the output.

 
 
-----Original message-----

> From:Jorge Betancourt <[hidden email]>
> Sent: Tuesday 29th August 2017 21:54
> To: [hidden email]
> Subject: Re: invalid utf8 chars when indexing or cleaning
>
>  From the logs looks like the error is coming from the Solr side, do you
> mind checking/sharing the logs on your Solr server? Can you pin point which
> URL is causing the issue?
> Best Regards, Jorge
>
> On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <[hidden email]>
> wrote:
> Does anybody have any thoughts on this? It seems similar to the NUTCH-1016
> bug that was fixed in version 1.4.
> Some more bits of information: the indexer job rarely fails (only 1 of the
> last 99 segments) but the cleaning job fails every time now. Once again,
> this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and
> Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of
> mismatch of versions?
>
>
> To: User <[hidden email]>
> Sent: Thursday, August 24, 2017 7:42 PM
> Subject: invalid utf8 chars when indexing or cleaning
>
> Lately, I have seen many tasks and jobs fail in Solr when doing nutch index
> and nutch clean.
> Messages during indexing look like this.
> 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99%
> 17/08/24 19:19:36 INFO mapreduce.Job: Task Id :
> attempt_1502929850483_1329_r_000007_2, Status : FAILED
> Error:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://codero4.neocortix.com:8984/solr/popular: [
> com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
> #104705, byte #219135)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
> at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
> at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
>
> Messages during cleaning look like this.
> 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57
> INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status
> : FAILEDError:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://codero4.neocortix.com:8984/solr/popular: 
> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
> #16099, byte #16383) at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
> at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
> Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1.
> I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing
> this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.

   
Reply | Threaded
Open this post in threaded view
|

RE: invalid utf8 chars when indexing or cleaning

Markus Jelsma-2
In reply to this post by Michael Coffey
Set logging to debug, HttpClient then logs what's being sent over the wire so you can catch the data. It is less tedious than Wireshark.

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Friday 1st September 2017 5:12
> To: [hidden email]
> Subject: Re: invalid utf8 chars when indexing or cleaning
>
> It sounds like a good suggestion, but I don't know what you mean by "verify the output Nutch generates and inspect it manually." How do I get a look at that XML?
>
>
>       From:
>  To: "[hidden email]" <[hidden email]>
>  Sent: Thursday, August 31, 2017 11:59 AM
>  Subject: RE: invalid utf8 chars when indexing or cleaning
>    
> The bug is identical, but i fixed it! You should verify the output Nutch generates and inspect it manually, there should be a 0xffff at that byte. If it really is there, we need to check the fix once more, despite that i am sure the patch works as intended.
>
> Get the XML, pass it through the method and see what it does to the output.
>
>  
>  
> -----Original message-----
> > From:Jorge Betancourt <[hidden email]>
> > Sent: Tuesday 29th August 2017 21:54
> > To: [hidden email]
> > Subject: Re: invalid utf8 chars when indexing or cleaning
> >
> >  From the logs looks like the error is coming from the Solr side, do you
> > mind checking/sharing the logs on your Solr server? Can you pin point which
> > URL is causing the issue?
> > Best Regards, Jorge
> >
> > On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <[hidden email]>
> > wrote:
> > Does anybody have any thoughts on this? It seems similar to the NUTCH-1016
> > bug that was fixed in version 1.4.
> > Some more bits of information: the indexer job rarely fails (only 1 of the
> > last 99 segments) but the cleaning job fails every time now. Once again,
> > this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and
> > Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of
> > mismatch of versions?
> >
> >
> > To: User <[hidden email]>
> > Sent: Thursday, August 24, 2017 7:42 PM
> > Subject: invalid utf8 chars when indexing or cleaning
> >
> > Lately, I have seen many tasks and jobs fail in Solr when doing nutch index
> > and nutch clean.
> > Messages during indexing look like this.
> > 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99%
> > 17/08/24 19:19:36 INFO mapreduce.Job: Task Id :
> > attempt_1502929850483_1329_r_000007_2, Status : FAILED
> > Error:
> > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> > from server at http://codero4.neocortix.com:8984/solr/popular: [
> > com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
> > #104705, byte #219135)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
> > at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209)
> > at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173)
> > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
> > at
> > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
> > at
> > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
> > at
> > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
> >
> > Messages during cleaning look like this.
> > 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57
> > INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_000003_1, Status
> > : FAILEDError:
> > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> > from server at http://codero4.neocortix.com:8984/solr/popular: 
> > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char
> > #16099, byte #16383) at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
> > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
> > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
> > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
> > at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222)
> > at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187)
> > at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
> > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at
> > org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
> > at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245)
> > Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1.
> > I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing
> > this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.
>
>