[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-2771) enableInputFilter() wrecks charset detection for some short html documents

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676887#comment-16676887 ]

Hans Brende edited comment on TIKA-2771 at 11/6/18 3:33 PM:
------------------------------------------------------------

Compare to the following analogous test for ISO-8859-1 variants:

||Issue||ISO-8859-X||byteMap'ed||2nd best||p||n||Wilson L.B.||IBM500 L.B.|
|TIKA-771|"Hello, World!"|"hello  world "|ISO-8859-1(it)|23%|10|p' = 7%|7%|
|TIKA-868|"Indanyl"|"indanyl"|ISO-8859-9(tr)|37%|7|p' = 12%|*13%*|
|TIKA-2771|"Name: Amanda\nJazz Band"|"name  amanda jazz band"|ISO-8859-1(en)|54%|18|*p' = 32%*|24%|

To calculate the Wilson lower bound, I used a confidence of 95% (i.e., z = 1.96).

I'm not saying that the Wilson lower bound is *the* way to go (as you can see, it wasn't quite enough to fix TIKA-868, although it did reduce the discrepancy from 60% - 37% = *23%* to 13% - 12% = *1%*). So this method might need some adjustments. However, it does seem to represent a significant improvement over the way things are *now*.

*A simpler alternative would be to simply discard charsets which, after being byteMap'ed, get more 0x20's that did not map from 0x20 (or 0x40 for IBM500) out of the input than a different charset does. This method would be successful for all 3 of the test cases I've presented here.* (However I'm not sure what the full ramifications for this strategy would be.)


was (Author: hansbrende):
Compare to the following analogous test for ISO-8859-1 variants:

||Issue||ISO-8859-X||byteMap'ed||2nd best||p||n||Wilson L.B.||IBM500 L.B.|
|TIKA-771|"Hello, World!"|"hello  world "|ISO-8859-1(it)|23%|10|p' = 7%|7%|
|TIKA-868|"Indanyl"|"indanyl"|ISO-8859-9(tr)|37%|7|p' = 12%|*13%*|
|TIKA-2771|"Name: Amanda\nJazz Band"|"name  amanda jazz band"|ISO-8859-1(en)|54%|18|*p' = 32%*|24%|

To calculate the Wilson lower bound, I used a confidence of 95% (i.e., z = 1.96).

I'm not saying that the Wilson lower bound is *the* way to go (as you can see, it wasn't quite enough to fix TIKA-868, although it did reduce the discrepancy from 60% - 37% = *23%* to 13% - 12% = *1%*). So this method might need some adjustments. However, it does seem to represent a significant improvement over the way things are *now*.

*A simpler alternative would be to simply discard charsets which, after being byteMap'ed, get more 0x20's out of the input than a different charset does. This method would be successful for all 3 of the test cases I've presented here.* (However I'm not sure what the full ramifications for this strategy would be.)

> enableInputFilter() wrecks charset detection for some short html documents
> --------------------------------------------------------------------------
>
>                 Key: TIKA-2771
>                 URL: https://issues.apache.org/jira/browse/TIKA-2771
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.19.1
>            Reporter: Hans Brende
>            Priority: Critical
>
> When I try to run the CharsetDetector on http://w3c.github.io/microdata-rdf/tests/0065.html I get the very strange most confident result of "IBM500" with a confidence of 60 when I enable the input filter, *even if I set the declared encoding to UTF-8*.
> This can be replicated with the following code:
> {code:java}
> CharsetDetector detect = new CharsetDetector();
> detect.enableInputFilter(true);
> detect.setDeclaredEncoding("UTF-8");
> detect.setText(("<!DOCTYPE html>\n" +
>         "<div>\n" +
>         "  <div itemscope itemtype=\"http://schema.org/Person\" id=\"amanda\" itemref=\"a b\"></div>\n" +
>         "  <p id=\"a\">Name: <span itemprop=\"name\">Amanda</span></p>\n" +
>         "  <p id=\"b\" itemprop=\"band\">Jazz Band</p>\n" +
>         "</div>").getBytes(StandardCharsets.UTF_8));
> Arrays.stream(detect.detectAll()).forEach(System.out::println);
> {code}
> which prints:
> {noformat}
> Match of IBM500 in fr with confidence 60
> Match of UTF-8 with confidence 57
> Match of ISO-8859-9 in tr with confidence 50
> Match of ISO-8859-1 in en with confidence 50
> Match of ISO-8859-2 in cs with confidence 12
> Match of Big5 in zh with confidence 10
> Match of EUC-KR in ko with confidence 10
> Match of EUC-JP in ja with confidence 10
> Match of GB18030 in zh with confidence 10
> Match of Shift_JIS in ja with confidence 10
> Match of UTF-16LE with confidence 10
> Match of UTF-16BE with confidence 10
> {noformat}
> Note that if I do not set the declared encoding to UTF-8, the result is even worse, with UTF-8 falling from a confidence of 57 to 15.
> This is screwing up 1 out of 84 of my online microdata extraction tests over in Any23 (as that particular page is being rendered into complete gibberish), so I had to implement some hacky workarounds which I'd like to remove if possible.
> EDIT: This issue may be related to TIKA-2737 and [this comment|https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=13213524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13213524].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)