[jira] Created: (TIKA-333) Improve accuracy of charset detection for HTML pages

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-333) Improve accuracy of charset detection for HTML pages

JIRA jira@apache.org
Improve accuracy of charset detection for HTML pages
----------------------------------------------------

                 Key: TIKA-333
                 URL: https://issues.apache.org/jira/browse/TIKA-333
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.5
            Reporter: Ken Krugler
            Priority: Minor


Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.

A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.

A more complex solution would be to scan for title and body tags, and pass bytes found in each.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (TIKA-333) Improve accuracy of charset detection for HTML pages

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler closed TIKA-333.
----------------------------

    Resolution: Not A Problem

In actually walking the parse code, I see that the real problem is that the HtmlParser code doesn't use the CharsetDetector. If no charset is passed in, then it just calls TagSoup, which by default uses the platform encoding. See [http://home.ccil.org/~cowan/XML/tagsoup/].

So I'll open another issue for the HtmlParser.

> Improve accuracy of charset detection for HTML pages
> ----------------------------------------------------
>
>                 Key: TIKA-333
>                 URL: https://issues.apache.org/jira/browse/TIKA-333
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.
> A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.
> A more complex solution would be to scan for title and body tags, and pass bytes found in each.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.