FW: Incorrect encoding detected

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

FW: Incorrect encoding detected

Markus Jelsma-2
I actually don't know, can we specify a tika-config file in Nutch?

Thanks,
Markus
 
-----Original message-----

> From:Allison, Timothy B. <[hidden email]>
> Sent: Tuesday 31st October 2017 13:11
> To: [hidden email]
> Subject: RE: Incorrect encoding detected
>
> For 1.17, the simplest solution, I think, is to allow users to configure extending the detection limit via our @Field config methods, that is, via tika-config.xml.
>
> To confirm, Nutch will allow users to specify a tika-config file?  Will this work for you and Nutch?
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: Tuesday, October 31, 2017 5:47 AM
> To: [hidden email]
> Subject: RE: Incorrect encoding detected
>
> Hello Timothy - what would be your preferred solution? Increase detection limit or skip inline styles and possibly other useless head information?
>
> Thanks,
> Markus
>
>  
>  
> -----Original message-----
> > From:Markus Jelsma <[hidden email]>
> > Sent: Friday 27th October 2017 15:37
> > To: [hidden email]
> > Subject: RE: Incorrect encoding detected
> >
> > Hi Tim,
> >
> > I have opened TIKA-2485 to track the problem.
> >
> > Thank you very very much!
> > Markus
> >
> >  
> >  
> > -----Original message-----
> > > From:Allison, Timothy B. <[hidden email]>
> > > Sent: Friday 27th October 2017 15:33
> > > To: [hidden email]
> > > Subject: RE: Incorrect encoding detected
> > >
> > > Unfortunately there is no way to do this now.  _I think_ we could make this configurable, though, fairly easily.  Please open a ticket.
> > >
> > > The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in.
> > >
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[hidden email]]
> > > Sent: Friday, October 27, 2017 9:12 AM
> > > To: [hidden email]
> > > Subject: RE: Incorrect encoding detected
> > >
> > > Hello Tim,
> > >
> > > Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit?
> > >
> > > Thanks!
> > > Markus
> > >
> > >  
> > >  
> > > -----Original message-----
> > > > From:Allison, Timothy B. <[hidden email]>
> > > > Sent: Friday 27th October 2017 14:53
> > > > To: [hidden email]
> > > > Subject: RE: Incorrect encoding detected
> > > >
> > > > Hi Markus,
> > > >  
> > > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
> > > >  
> > > > At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
> > > >
> > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
> > > >
> > > > Cheers,
> > > >
> > > >                Tim
> > > >
> > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > >    
> > > >
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:[hidden email]]
> > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > To: [hidden email]
> > > > Subject: Incorrect encoding detected
> > > >
> > > > Hello,
> > > >
> > > > We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > >
> > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
> > > >
> > > > Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.
> > > >
> > > > Any tips to spare?
> > > >
> > > > Many many thanks!
> > > > Markus
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Incorrect encoding detected

Markus Jelsma-2
Any ideas?

Thanks!

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Tuesday 31st October 2017 13:14
> To: User <[hidden email]>
> Subject: FW: Incorrect encoding detected
>
> I actually don't know, can we specify a tika-config file in Nutch?
>
> Thanks,
> Markus
>  
> -----Original message-----
> > From:Allison, Timothy B. <[hidden email]>
> > Sent: Tuesday 31st October 2017 13:11
> > To: [hidden email]
> > Subject: RE: Incorrect encoding detected
> >
> > For 1.17, the simplest solution, I think, is to allow users to configure extending the detection limit via our @Field config methods, that is, via tika-config.xml.
> >
> > To confirm, Nutch will allow users to specify a tika-config file?  Will this work for you and Nutch?
> >
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[hidden email]]
> > Sent: Tuesday, October 31, 2017 5:47 AM
> > To: [hidden email]
> > Subject: RE: Incorrect encoding detected
> >
> > Hello Timothy - what would be your preferred solution? Increase detection limit or skip inline styles and possibly other useless head information?
> >
> > Thanks,
> > Markus
> >
> >  
> >  
> > -----Original message-----
> > > From:Markus Jelsma <[hidden email]>
> > > Sent: Friday 27th October 2017 15:37
> > > To: [hidden email]
> > > Subject: RE: Incorrect encoding detected
> > >
> > > Hi Tim,
> > >
> > > I have opened TIKA-2485 to track the problem.
> > >
> > > Thank you very very much!
> > > Markus
> > >
> > >  
> > >  
> > > -----Original message-----
> > > > From:Allison, Timothy B. <[hidden email]>
> > > > Sent: Friday 27th October 2017 15:33
> > > > To: [hidden email]
> > > > Subject: RE: Incorrect encoding detected
> > > >
> > > > Unfortunately there is no way to do this now.  _I think_ we could make this configurable, though, fairly easily.  Please open a ticket.
> > > >
> > > > The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in.
> > > >
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:[hidden email]]
> > > > Sent: Friday, October 27, 2017 9:12 AM
> > > > To: [hidden email]
> > > > Subject: RE: Incorrect encoding detected
> > > >
> > > > Hello Tim,
> > > >
> > > > Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit?
> > > >
> > > > Thanks!
> > > > Markus
> > > >
> > > >  
> > > >  
> > > > -----Original message-----
> > > > > From:Allison, Timothy B. <[hidden email]>
> > > > > Sent: Friday 27th October 2017 14:53
> > > > > To: [hidden email]
> > > > > Subject: RE: Incorrect encoding detected
> > > > >
> > > > > Hi Markus,
> > > > >  
> > > > > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
> > > > >  
> > > > > At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
> > > > >
> > > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
> > > > >
> > > > > Cheers,
> > > > >
> > > > >                Tim
> > > > >
> > > > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > > > >    
> > > > >
> > > > > -----Original Message-----
> > > > > From: Markus Jelsma [mailto:[hidden email]]
> > > > > Sent: Friday, October 27, 2017 8:39 AM
> > > > > To: [hidden email]
> > > > > Subject: Incorrect encoding detected
> > > > >
> > > > > Hello,
> > > > >
> > > > > We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > > >
> > > > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
> > > > >
> > > > > Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.
> > > > >
> > > > > Any tips to spare?
> > > > >
> > > > > Many many thanks!
> > > > > Markus
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Incorrect encoding detected

Sebastian Nagel
I hadn't the time to dig into the problem.
Neither how to pass a tika-config file nor why
actually parse-html is detecting the encoding
although it's also only looking for the first 8192
characters (see CHUNK_SIZE).

Just one point: for the MIME detection we also
pass the Content-Type sent by the web server to Tika.
Could this also be help to pass it as additional glue?
In the concrete example the server sends
  Content-Type: text/html; charset=utf-8

Sebastian

On 11/01/2017 07:06 PM, Markus Jelsma wrote:

> Any ideas?
>
> Thanks!
>
>  
>  
> -----Original message-----
>> From:Markus Jelsma <[hidden email]>
>> Sent: Tuesday 31st October 2017 13:14
>> To: User <[hidden email]>
>> Subject: FW: Incorrect encoding detected
>>
>> I actually don't know, can we specify a tika-config file in Nutch?
>>
>> Thanks,
>> Markus
>>  
>> -----Original message-----
>>> From:Allison, Timothy B. <[hidden email]>
>>> Sent: Tuesday 31st October 2017 13:11
>>> To: [hidden email]
>>> Subject: RE: Incorrect encoding detected
>>>
>>> For 1.17, the simplest solution, I think, is to allow users to configure extending the detection limit via our @Field config methods, that is, via tika-config.xml.
>>>
>>> To confirm, Nutch will allow users to specify a tika-config file?  Will this work for you and Nutch?
>>>
>>> -----Original Message-----
>>> From: Markus Jelsma [mailto:[hidden email]]
>>> Sent: Tuesday, October 31, 2017 5:47 AM
>>> To: [hidden email]
>>> Subject: RE: Incorrect encoding detected
>>>
>>> Hello Timothy - what would be your preferred solution? Increase detection limit or skip inline styles and possibly other useless head information?
>>>
>>> Thanks,
>>> Markus
>>>
>>>  
>>>  
>>> -----Original message-----
>>>> From:Markus Jelsma <[hidden email]>
>>>> Sent: Friday 27th October 2017 15:37
>>>> To: [hidden email]
>>>> Subject: RE: Incorrect encoding detected
>>>>
>>>> Hi Tim,
>>>>
>>>> I have opened TIKA-2485 to track the problem.
>>>>
>>>> Thank you very very much!
>>>> Markus
>>>>
>>>>  
>>>>  
>>>> -----Original message-----
>>>>> From:Allison, Timothy B. <[hidden email]>
>>>>> Sent: Friday 27th October 2017 15:33
>>>>> To: [hidden email]
>>>>> Subject: RE: Incorrect encoding detected
>>>>>
>>>>> Unfortunately there is no way to do this now.  _I think_ we could make this configurable, though, fairly easily.  Please open a ticket.
>>>>>
>>>>> The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Markus Jelsma [mailto:[hidden email]]
>>>>> Sent: Friday, October 27, 2017 9:12 AM
>>>>> To: [hidden email]
>>>>> Subject: RE: Incorrect encoding detected
>>>>>
>>>>> Hello Tim,
>>>>>
>>>>> Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit?
>>>>>
>>>>> Thanks!
>>>>> Markus
>>>>>
>>>>>  
>>>>>  
>>>>> -----Original message-----
>>>>>> From:Allison, Timothy B. <[hidden email]>
>>>>>> Sent: Friday 27th October 2017 14:53
>>>>>> To: [hidden email]
>>>>>> Subject: RE: Incorrect encoding detected
>>>>>>
>>>>>> Hi Markus,
>>>>>>  
>>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
>>>>>>  
>>>>>> At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
>>>>>>
>>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>                Tim
>>>>>>
>>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038
>>>>>>    
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Markus Jelsma [mailto:[hidden email]]
>>>>>> Sent: Friday, October 27, 2017 8:39 AM
>>>>>> To: [hidden email]
>>>>>> Subject: Incorrect encoding detected
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
>>>>>>
>>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
>>>>>>
>>>>>> Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.
>>>>>>
>>>>>> Any tips to spare?
>>>>>>
>>>>>> Many many thanks!
>>>>>> Markus
>>>>>>
>>>>>
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: Incorrect encoding detected

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello Sebastian,

I just spotted tika.config.file in the TikaParser, so that's how we can instruct a specific config.

Meanwhile Timothy Allison committed a fix. I will try the nightly build tomorrow.

Thanks,
Markus
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Thursday 2nd November 2017 13:32
> To: [hidden email]
> Subject: Re: Incorrect encoding detected
>
> I hadn't the time to dig into the problem.
> Neither how to pass a tika-config file nor why
> actually parse-html is detecting the encoding
> although it's also only looking for the first 8192
> characters (see CHUNK_SIZE).
>
> Just one point: for the MIME detection we also
> pass the Content-Type sent by the web server to Tika.
> Could this also be help to pass it as additional glue?
> In the concrete example the server sends
>   Content-Type: text/html; charset=utf-8
>
> Sebastian
>
> On 11/01/2017 07:06 PM, Markus Jelsma wrote:
> > Any ideas?
> >
> > Thanks!
> >
> >  
> >  
> > -----Original message-----
> >> From:Markus Jelsma <[hidden email]>
> >> Sent: Tuesday 31st October 2017 13:14
> >> To: User <[hidden email]>
> >> Subject: FW: Incorrect encoding detected
> >>
> >> I actually don't know, can we specify a tika-config file in Nutch?
> >>
> >> Thanks,
> >> Markus
> >>  
> >> -----Original message-----
> >>> From:Allison, Timothy B. <[hidden email]>
> >>> Sent: Tuesday 31st October 2017 13:11
> >>> To: [hidden email]
> >>> Subject: RE: Incorrect encoding detected
> >>>
> >>> For 1.17, the simplest solution, I think, is to allow users to configure extending the detection limit via our @Field config methods, that is, via tika-config.xml.
> >>>
> >>> To confirm, Nutch will allow users to specify a tika-config file?  Will this work for you and Nutch?
> >>>
> >>> -----Original Message-----
> >>> From: Markus Jelsma [mailto:[hidden email]]
> >>> Sent: Tuesday, October 31, 2017 5:47 AM
> >>> To: [hidden email]
> >>> Subject: RE: Incorrect encoding detected
> >>>
> >>> Hello Timothy - what would be your preferred solution? Increase detection limit or skip inline styles and possibly other useless head information?
> >>>
> >>> Thanks,
> >>> Markus
> >>>
> >>>  
> >>>  
> >>> -----Original message-----
> >>>> From:Markus Jelsma <[hidden email]>
> >>>> Sent: Friday 27th October 2017 15:37
> >>>> To: [hidden email]
> >>>> Subject: RE: Incorrect encoding detected
> >>>>
> >>>> Hi Tim,
> >>>>
> >>>> I have opened TIKA-2485 to track the problem.
> >>>>
> >>>> Thank you very very much!
> >>>> Markus
> >>>>
> >>>>  
> >>>>  
> >>>> -----Original message-----
> >>>>> From:Allison, Timothy B. <[hidden email]>
> >>>>> Sent: Friday 27th October 2017 15:33
> >>>>> To: [hidden email]
> >>>>> Subject: RE: Incorrect encoding detected
> >>>>>
> >>>>> Unfortunately there is no way to do this now.  _I think_ we could make this configurable, though, fairly easily.  Please open a ticket.
> >>>>>
> >>>>> The next RC for PDFBox might be out next week, and we'll try to release Tika 1.17 shortly after that...so there should be time to get this in.
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Markus Jelsma [mailto:[hidden email]]
> >>>>> Sent: Friday, October 27, 2017 9:12 AM
> >>>>> To: [hidden email]
> >>>>> Subject: RE: Incorrect encoding detected
> >>>>>
> >>>>> Hello Tim,
> >>>>>
> >>>>> Getting rid of script and style contents sounds plausible indeed. But to work around the problem for now, can i instruct HTMLEncodingDetector from within Nutch to look beyond the limit?
> >>>>>
> >>>>> Thanks!
> >>>>> Markus
> >>>>>
> >>>>>  
> >>>>>  
> >>>>> -----Original message-----
> >>>>>> From:Allison, Timothy B. <[hidden email]>
> >>>>>> Sent: Friday 27th October 2017 14:53
> >>>>>> To: [hidden email]
> >>>>>> Subject: RE: Incorrect encoding detected
> >>>>>>
> >>>>>> Hi Markus,
> >>>>>>  
> >>>>>> My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
> >>>>>>  
> >>>>>> At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
> >>>>>>
> >>>>>> Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>>                Tim
> >>>>>>
> >>>>>> [0] https://issues.apache.org/jira/browse/TIKA-2038
> >>>>>>    
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Markus Jelsma [mailto:[hidden email]]
> >>>>>> Sent: Friday, October 27, 2017 8:39 AM
> >>>>>> To: [hidden email]
> >>>>>> Subject: Incorrect encoding detected
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> >>>>>>
> >>>>>> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
> >>>>>>
> >>>>>> Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.
> >>>>>>
> >>>>>> Any tips to spare?
> >>>>>>
> >>>>>> Many many thanks!
> >>>>>> Markus
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>