[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519679#comment-16519679 ]

Tim Allison commented on TIKA-2673:

[~gbouchar], thank you for these unit tests!  I've added them and made the easy fixes where I could.  As you know, to do a full parse is non-trivial, and I'd like evidence from some corpus that the effort is worth it.  


If you'd like to contribute a StrictHTMLEncodingDetector, we could compare the performance of that with what we have on our 1TB regression corpus.


If you'd like access to our VM either to run your own comparisons or to help us curate it and make it more representative of modern websites with diverse languages and encodings, let me know.

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java
> This bug is linked to TIKA-2671, but does not concern metadata, but rather the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where HtmlEncodingDetector differs from the specification, and thus fails at detecting the right charset.
> I am attaching the test cases to this issue:

This message was sent by Atlassian JIRA