[jira] Created: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

JIRA jira@apache.org
Use http-equiv meta tag charset info when processing HTML documents
-------------------------------------------------------------------

                 Key: TIKA-332
                 URL: https://issues.apache.org/jira/browse/TIKA-332
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.5
            Reporter: Ken Krugler
            Priority: Critical


Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.

If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:

    private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");

If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.

In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.

I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782550#action_12782550 ]

Ken Krugler commented on TIKA-332:
----------------------------------

It turns out the HtmlParser code doesn't even use the CharsetDetector support - this is only being used by the TXTParser, as far as I can tell (and incorrectly at that).


> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Critical
>
> Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
> I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-332:
-----------------------------

    Description:
Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.

If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:

    private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");

If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.

In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.

Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.

  was:
Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.

If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:

    private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");

If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.

In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.

I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages.


> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Critical
>
> Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-332:
-----------------------------

    Attachment: TIKA-332.patch

> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Critical
>         Attachments: TIKA-332.patch
>
>
> Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-332:
-----------------------------

    Attachment: TIKA-332-2.patch

Additional cleanup to new test, plus others - include <head> tags around <title>, <meta>, <base> tags for more conformant HTML. Needs to be applied after TIKA-332-.patch.


> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Critical
>         Attachments: TIKA-332-2.patch, TIKA-332.patch
>
>
> Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-332.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

Patches applied in revision 890009.

> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.6
>
>         Attachments: TIKA-332-2.patch, TIKA-332.patch
>
>
> Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.