addBinaryContent and string length must be a multiple of four

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

addBinaryContent and string length must be a multiple of four

Michael Coffey
I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.

Reply | Threaded
Open this post in threaded view
|

Re: addBinaryContent and string length must be a multiple of four

Michael Coffey
I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file.


-- original message --
I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186

I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.

Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.
Reply | Threaded
Open this post in threaded view
|

Re: addBinaryContent and string length must be a multiple of four

Sebastian Nagel
Hi Michael,

can you share more information regarding Nutch and Solr version and at least one document
to make the problem reproducible. Looks like that's not a general problem - at least,
I'm not able to reproduce it, indexing with -addBinaryContent -base64 succeeds (recent
Nutch snapshot / master, Solr 6.6.0).

Thanks,
Sebastian

On 10/20/2017 06:46 PM, Michael Coffey wrote:

> I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file.
>
>
> -- original message --
> I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186
>
> I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument.
>
> Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.
>

Reply | Threaded
Open this post in threaded view
|

Re: addBinaryContent and string length must be a multiple of four

Michael Coffey
Thanks for the reply!

I'm not sure the best way to illustrate the issue, as I struggle with solr log management within docker. However, here are a few URLs that have exhibited the problem. In each case, Solr complains "Error adding field 'binaryContent'" ... "msg=String length must be a multiple of four"


http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html


http://buzz.money.cnn.com/author/ctymkiw/

http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448

http://buzz.money.cnn.com/tag/investing/

Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related.

http://buzz.money.cnn.com/author/byheatherlong/


All tests done with Nutch 1.12, Solr 5.4.1.

BTW, I wouldn't mind updating Nutch and Solr. What is your recommended most-stable combination of versions? I am using Hadoop 2.7.3 (from Hortonworks).


At one point, Lewis John McG reported on such an issue in https://issues.apache.org/jira/browse/NUTCH-2186
Reply | Threaded
Open this post in threaded view
|

Re: addBinaryContent and string length must be a multiple of four

Sebastian Nagel
Hi Michael,

I tried to reproduce the problem with the current Nutch master and Solr 6.6.0
without success, resp. indexing the binary content succeeded:
- that's the case for two of the URLs you sent
- those from buzz.money.cnn.com are blocked somehow (fetching failed)

Building Nutch isn't difficult:
 git clone http://github.com/apache/nutch.git
 cd nutch
 ant
You'll find the Nutch runtime is in runtime/local/ or runtime/deploy/ (for usage on Hadoop).

The tutorial
  https://wiki.apache.org/nutch/NutchTutorial
should be already up-to-date on how to use recent
Solr versions.


Best,
Sebastian



{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{

"q":"id:http\\://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
      "indent":"on",
      "wt":"json",
      "_":"1508829081797"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "date":"2017-10-24T07:01:05.593Z",
        "author":"Matt Egan",
        "title":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, 2017",
        "type":["application/xhtml+xml",
          "application",
          "xhtml+xml"],

"url":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
        "content":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, ...",
        "tstamp":"2017-10-24T07:01:05.593Z",
        "segment":"20171024090054",
        "digest":"cff265f11bd74bd104f3c6e1c7185484",
        "boost":1.0,

"id":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html",
        "_version_":1582121409782480896,

"binaryContent":"+IDxzY3JpcHQgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4gdmFyIHVybFByZT0iaHR0cDovL21hcmtld...""}]
  }}


On 10/24/2017 01:07 AM, Michael Coffey wrote:

> http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html
>
>
> http://buzz.money.cnn.com/author/ctymkiw/
>
> http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448
>
> http://buzz.money.cnn.com/tag/investing/
>
> Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related.
>
> http://buzz.money.cnn.com/author/byheatherlong/