[jira] Created: (TIKA-273) Content encoding in HtmlParser

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-273) Content encoding in HtmlParser

Hudson (Jira)
Content encoding in HtmlParser
------------------------------

                 Key: TIKA-273
                 URL: https://issues.apache.org/jira/browse/TIKA-273
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
            Reporter: Piotr B.


Sometimes content encoding method is stored outside html document, for instance in mime mail with html attachment.
The problem is for text/html documents without http-equiv section. Actually there is no way to pass this information to the parser.

My fix for parse method in HtmlParser.java:

-        parser.parse(new InputSource(stream));
+        InputSource source = new InputSource(stream);
+        String encoding = metadata.get(Metadata.CONTENT_ENCODING);
+        if (encoding != null) {
+            source.setEncoding(encoding);
+        parser.parse(source);

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-273) Content encoding in HtmlParser

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-273.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Thanks! Fixed as suggested in revision 813626.

> Content encoding in HtmlParser
> ------------------------------
>
>                 Key: TIKA-273
>                 URL: https://issues.apache.org/jira/browse/TIKA-273
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Piotr B.
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> Sometimes content encoding method is stored outside html document, for instance in mime mail with html attachment.
> The problem is for text/html documents without http-equiv section. Actually there is no way to pass this information to the parser.
> My fix for parse method in HtmlParser.java:
> -        parser.parse(new InputSource(stream));
> +        InputSource source = new InputSource(stream);
> +        String encoding = metadata.get(Metadata.CONTENT_ENCODING);
> +        if (encoding != null) {
> +            source.setEncoding(encoding);
> +        parser.parse(source);

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.