[jira] Created: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

JIRA jira@apache.org
Use charset in CONTENT_TYPE metadata when detecting the character encoding
--------------------------------------------------------------------------

                 Key: TIKA-341
                 URL: https://issues.apache.org/jira/browse/TIKA-341
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.6
            Reporter: Ken Krugler


If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.

Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-341:
-----------------------------

    Priority: Minor  (was: Major)

> Use charset in CONTENT_TYPE metadata when detecting the character encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-341
>                 URL: https://issues.apache.org/jira/browse/TIKA-341
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Priority: Minor
>
> If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.
> Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-341:
-----------------------------

    Attachment: TIKA-341.patch

> Use charset in CONTENT_TYPE metadata when detecting the character encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-341
>                 URL: https://issues.apache.org/jira/browse/TIKA-341
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-341.patch
>
>
> If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.
> Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-341) Use charset in CONTENT_TYPE metadata when detecting the character encoding

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-341.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

Patch applied in revision 890014.

> Use charset in CONTENT_TYPE metadata when detecting the character encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-341
>                 URL: https://issues.apache.org/jira/browse/TIKA-341
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: TIKA-341.patch
>
>
> If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.
> Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.