[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676109#comment-16676109 ]

Hans Brende commented on TIKA-2771:

[~[hidden email]] I did a little experimentation with each of the input texts that were causing trouble, for TIKA-771, TIKA-868, and this issue.

Here are the results:
||Issue||Input Text||IBM500||byteMap'ed||best||p||n||Wilson L.B.||
|TIKA-771|"Hello, World!"|"çÁ%%?  ï?Ê%À "|"çá     ï ê à "|IBM500(fr)|30%|5|*p' = 7%*|
|TIKA-868|"Indanyl"|"ñ>À/>`%"|"ñ à    "|IBM500(fr)|60%|2|*p' = 13%*|
|TIKA-2771|"Name: Amanda\nJazz Band"|"+/\_Á   \_/>À/ [/:: â/>À"|"   á      à       â  à"|IBM500(fr)|66%|4|*p' = 24%*|

One thing is evident to me from this test: it's not the mapping of control & punctuation chars to 0x20 that's the problem (the byteMap for ISO-8859-1 *also* strips control & punct chars by mapping them to whitespace)! Rather, the problem lies in the fact that under IBM500, much of the text is *likely* to be mapped to punctuation & control chars, *but the confidence is not reduced when the amount of actual alphabetic text being tested shrinks to near-zero.*

The lower bound of the Wilson score confidence interval, however, seems to give a much better estimate of our actual confidence based on the number of characters we actually end up testing. (And while the initial "confidence" value is to some extent arbitrary, the importance of the Wilson lower bound is not the final number we get out, but that we are reducing confidences *relative* to the confidences of other charsets that succeeded in getting more alphabetic text out of the input, and *relative* to the confidences of any declared charsets.)

> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
> When I try to run the CharsetDetector on http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most confident result of "IBM500" with a confidence of 60 when I enable the input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\" itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even worse, with UTF-8 falling from a confidence of 57 to 15.
> This is screwing up 1 out of 84 of my online microdata extraction tests over in Any23 (as that particular page is being rendered into complete gibberish), so I had to implement some hacky workarounds which I'd like to remove if possible.
> EDIT: This issue may be related to TIKA-2737 and [this comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].

This message was sent by Atlassian JIRA