[jira] Commented: (NUTCH-25) needs 'character encoding' detector

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ]

Chris Fellows commented on NUTCH-25:

This was last updated May '05. Has this charset and language detection been integrated into Nutch yet?

If not, at what point should the detection happen? Fetching, parsing, etc. If this hasn't been fixed any leads into where to insert the detection would helpful.

> needs 'character encoding' detector
> -----------------------------------
>          Key: NUTCH-25
>          URL: http://issues.apache.org/jira/browse/NUTCH-25
>      Project: Nutch
>         Type: Wish

>     Reporter: Stefan Groschupf
>     Priority: Trivial

> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents.
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection.
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see: