[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541307#comment-16541307 ]

Gerard Bouchar commented on TIKA-2673:
--------------------------------------

[~[hidden email]] : great, thank you very much ! Of course I agree for it to be merged. I'm sorry for forgetting the license header in the first place.

I have done more work on this in the last days. I am going to make a pull request to include my last changes.

We have conducted an internal testing on this, and have seen great results. We selected a random subset of ~100 000 URLs from a nutch segment, fetched it once in nutched, and parsed it using different strategies. We fetched the same URLs using puppeteer (a headless chrome), and compared the charset detected. Here are the results

!https://confluence.qwant.ninja/confluence/download/attachments/25790597/image2018-7-11_16-50-32.png?version=1&modificationDate=1531320645751&api=v2!

standard_noparse is a composite detector with a version of my detector that just takes into account the BOM and HTTP headers, chained with the existing HtmlEncodingDetector, chained with Icu4JEncodingDetector.

standard is a composite detector with the last version of my detector, chained with Icu4JEncodingDetector.

Labeled as "correct" are the pages that were detected the same in chrome and tika. "similar" means that although incorrect, the detected charset is close to the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). "wrong" means that the detected charset was not close to the one detected by chrome.

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java, StrictHtmlEncodingDetector.tar.gz
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where HtmlEncodingDetector differs from the specification, and thus fails at detecting the right charset.
> I am attaching the test cases to this issue:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)