[jira] Created: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

JIRA jira@apache.org
HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
----------------------------------------------------------------------------------------------

                 Key: TIKA-334
                 URL: https://issues.apache.org/jira/browse/TIKA-334
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.5
            Reporter: Ken Krugler


Currently the HtmlParser will just call TagSoup to parse, without specifying a charset, if no charset is passed in via metadata.

TagSoup uses the platform encoding in this case, which is often going to be wrong.

The right thing to do is to first check for a charset specified by a meta tag. If that doesn't exist, then create a CharsetDetector. If there's a charset in the incoming meta-data, use that to call setDeclaredEncoding().


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-334:
-----------------------------

    Attachment: TIKA-334.patch

> HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
> ----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-334
>                 URL: https://issues.apache.org/jira/browse/TIKA-334
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>         Attachments: TIKA-334.patch
>
>
> Currently the HtmlParser will just call TagSoup to parse, without specifying a charset, if no charset is passed in via metadata.
> TagSoup uses the platform encoding in this case, which is often going to be wrong.
> The right thing to do is to first check for a charset specified by a meta tag. If that doesn't exist, then create a CharsetDetector. If there's a charset in the incoming meta-data, use that to call setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-334.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

Patch applied in revision 885308. Thanks!

PS. I updated the test case to use a Unicode escape for the non-ASCII character to avoid problems with source encoding.

> HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
> ----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-334
>                 URL: https://issues.apache.org/jira/browse/TIKA-334
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.6
>
>         Attachments: TIKA-334.patch
>
>
> Currently the HtmlParser will just call TagSoup to parse, without specifying a charset, if no charset is passed in via metadata.
> TagSoup uses the platform encoding in this case, which is often going to be wrong.
> The right thing to do is to first check for a charset specified by a meta tag. If that doesn't exist, then create a CharsetDetector. If there's a charset in the incoming meta-data, use that to call setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.