[jira] Created: (TIKA-274) CharsetDetector.setDeclaredEncoding has no effect

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-274) CharsetDetector.setDeclaredEncoding has no effect

Tim Allison (Jira)
CharsetDetector.setDeclaredEncoding has no effect
-------------------------------------------------

                 Key: TIKA-274
                 URL: https://issues.apache.org/jira/browse/TIKA-274
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
            Reporter: Piotr B.


In TXTParser.java we may read:

        // Use the declared character encoding, if available
        String encoding = metadata.get(Metadata.CONTENT_ENCODING);
        if (encoding != null) {
            detector.setDeclaredEncoding(encoding);
        }

But it seems to be not implemented feature.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-274) CharsetDetector.setDeclaredEncoding has no effect

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-274.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Hmm, good point. It looks like the feature was never implemented in the ICU4J code that we're using.

I modified the TXTParser code in revision 813624 so that we now always use the given encoding as the default in case the automatic encoding detection fails.

This behavior could be further improved by making the encoding hint affect the detection code for example when choosing between the highly similar ISO-8859-X character sets. Please file a new improvement issue if you have a concrete use case where this would be beneficial.

> CharsetDetector.setDeclaredEncoding has no effect
> -------------------------------------------------
>
>                 Key: TIKA-274
>                 URL: https://issues.apache.org/jira/browse/TIKA-274
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Piotr B.
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> In TXTParser.java we may read:
>         // Use the declared character encoding, if available
>         String encoding = metadata.get(Metadata.CONTENT_ENCODING);
>         if (encoding != null) {
>             detector.setDeclaredEncoding(encoding);
>         }
> But it seems to be not implemented feature.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.