Charset detection

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Charset detection

Antoni Mylka-2
Aperturians, Tika

I was wondering if anyone has any experience with the jchardet library
for charset detection. Does it work? What kinds of documents does it
actually support.

Christiaan has posted an idea to the Aperture tracker how we could use
jchardet to improve the plain text extractor, but it doesn't seem to
work.  Or maybe the Tika guys have figured it out already and I can just
use Tika for this? :)

Antoni Mylka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Charset detection

Jérôme Charron
Hi Antoni,

I tried many charset detection libraries while working on Nutch but none of
them was really working.
I also tried to take a look at the mozilla charset detector , but it was
really too complicated to integrate into Nutch (or Tika).

Best regards

Jérôme

2009/12/9 Antoni Mylka <[hidden email]>

> Aperturians, Tika
>
> I was wondering if anyone has any experience with the jchardet library
> for charset detection. Does it work? What kinds of documents does it
> actually support.
>
> Christiaan has posted an idea to the Aperture tracker how we could use
> jchardet to improve the plain text extractor, but it doesn't seem to
> work.  Or maybe the Tika guys have figured it out already and I can just
> use Tika for this? :)
>
> Antoni Mylka
> [hidden email]
>



--
Jérôme Charron
Directeur Technique @ WebPulse
Tel: +33675742890 <= ** NEW **
eMail : [hidden email]
http://www.webpulse.fr/
http://www.shopreflex.com/
http://www.staragora.com/
Reply | Threaded
Open this post in threaded view
|

Re: Charset detection

Alex Ott
Hello

From my experience, use of n-gram's for one-byte encodings works pretty
good for language/charset detection


2009/12/9 Jérôme Charron <[hidden email]>:

> Hi Antoni,
>
> I tried many charset detection libraries while working on Nutch but none of
> them was really working.
> I also tried to take a look at the mozilla charset detector , but it was
> really too complicated to integrate into Nutch (or Tika).
>
> Best regards
>
> Jérôme
>
> 2009/12/9 Antoni Mylka <[hidden email]>
>
>> Aperturians, Tika
>>
>> I was wondering if anyone has any experience with the jchardet library
>> for charset detection. Does it work? What kinds of documents does it
>> actually support.
>>
>> Christiaan has posted an idea to the Aperture tracker how we could use
>> jchardet to improve the plain text extractor, but it doesn't seem to
>> work.  Or maybe the Tika guys have figured it out already and I can just
>> use Tika for this? :)
>>
>> Antoni Mylka
>> [hidden email]
>>
>
>
>
> --
> Jérôme Charron
> Directeur Technique @ WebPulse
> Tel: +33675742890 <= ** NEW **
> eMail : [hidden email]
> http://www.webpulse.fr/
> http://www.shopreflex.com/
> http://www.staragora.com/
>



--
With best wishes,                    Alex Ott, MBA
http://alexott.blogspot.com/
http://alexott-ru.blogspot.com/
http://xtalk.msk.su/~ott/
Reply | Threaded
Open this post in threaded view
|

Re: [Aperture-devel] Charset detection

project2501
In reply to this post by Jérôme Charron
Yeah, there are many indefinites with regards to charset detection and
there is no 100% accurate method of interpreting the charset. Its more art
than science. That said, I will hunt around for a decent library too.

> Hi Antoni,
>
> I tried many charset detection libraries while working on Nutch but none
> of
> them was really working.
> I also tried to take a look at the mozilla charset detector , but it was
> really too complicated to integrate into Nutch (or Tika).
>
> Best regards
>
> Jérôme
>
> 2009/12/9 Antoni Mylka <[hidden email]>
>
>> Aperturians, Tika
>>
>> I was wondering if anyone has any experience with the jchardet library
>> for charset detection. Does it work? What kinds of documents does it
>> actually support.
>>
>> Christiaan has posted an idea to the Aperture tracker how we could use
>> jchardet to improve the plain text extractor, but it doesn't seem to
>> work.  Or maybe the Tika guys have figured it out already and I can just
>> use Tika for this? :)
>>
>> Antoni Mylka
>> [hidden email]
>>
>
>
>
> --
> Jérôme Charron
> Directeur Technique @ WebPulse
> Tel: +33675742890 <= ** NEW **
> eMail : [hidden email]
> http://www.webpulse.fr/
> http://www.shopreflex.com/
> http://www.staragora.com/
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Aperture-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>

Reply | Threaded
Open this post in threaded view
|

Re: [Aperture-devel] Charset detection

Thilo Goetz
I've had reasonable success with the ICU charset
detection, but that's the only one I've tried and
so can't compare it to any other.

--Thilo

On 12/9/2009 17:10, [hidden email] wrote:

> Yeah, there are many indefinites with regards to charset detection and
> there is no 100% accurate method of interpreting the charset. Its more art
> than science. That said, I will hunt around for a decent library too.
>
>> Hi Antoni,
>>
>> I tried many charset detection libraries while working on Nutch but none
>> of
>> them was really working.
>> I also tried to take a look at the mozilla charset detector , but it was
>> really too complicated to integrate into Nutch (or Tika).
>>
>> Best regards
>>
>> Jérôme
>>
>> 2009/12/9 Antoni Mylka <[hidden email]>
>>
>>> Aperturians, Tika
>>>
>>> I was wondering if anyone has any experience with the jchardet library
>>> for charset detection. Does it work? What kinds of documents does it
>>> actually support.
>>>
>>> Christiaan has posted an idea to the Aperture tracker how we could use
>>> jchardet to improve the plain text extractor, but it doesn't seem to
>>> work.  Or maybe the Tika guys have figured it out already and I can just
>>> use Tika for this? :)
>>>
>>> Antoni Mylka
>>> [hidden email]
>>>
>>
>>
>>
>> --
>> Jérôme Charron
>> Directeur Technique @ WebPulse
>> Tel: +33675742890 <= ** NEW **
>> eMail : [hidden email]
>> http://www.webpulse.fr/
>> http://www.shopreflex.com/
>> http://www.staragora.com/
>> ------------------------------------------------------------------------------
>> Return on Information:
>> Google Enterprise Search pays you back
>> Get the facts.
>> http://p.sf.net/sfu/google-dev2dev
>> _______________________________________________
>> Aperture-devel mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>
Reply | Threaded
Open this post in threaded view
|

Re: [Aperture-devel] Charset detection

Christiaan Fluit-2
In reply to this post by Antoni Mylka-2
Antoni Mylka wrote:
> I was wondering if anyone has any experience with the jchardet library
> for charset detection. Does it work? What kinds of documents does it
> actually support.
>
> Christiaan has posted an idea to the Aperture tracker how we could use
> jchardet to improve the plain text extractor, but it doesn't seem to
> work.  Or maybe the Tika guys have figured it out already and I can just
> use Tika for this? :)

We started using jchardet in conjunction with cpdetector to better
support Chinese, Japanese and Korean documents in our app on all Windows
language variants. Else it would need to fall back to the default
platform encoding or a user setting when a UTF Byte Order Mark was
missing. It seemed to do a pretty good job on the test files that I used
(primarily CJK and English docs). Only recently we found out that
jchardet doesn't detect Cyrillic documents.

It seems that the set of supported charsets in jchardet is a subset of
those supported by Mozilla/Firefox (jcharset is supposed to be a Java
port of the charset detection algorithm in those apps). As additional
charsets are a matter of porting some static data structures encoded in
C or C++ to Java, perhaps it's feasible to do that ourselves? Provided
that the algorithm hasn't changed of course. I did not have any contact
with any of the jchardet developers yet.

When testing the Aperture test docs, only plain-text-utf16le.txt does
not get processed correctly anymore, correct? This is a cpdetector
problem, not a jcharset problem. We already have solid code (IMHO :) )
for BOM detection in our existing PlainTextExtractor, no need to use
cpdetector's ByteOrderMarkDetector.


Regards,

Chris
--