[jira] Created: (TIKA-40) Tika needs to support diverse character encodings.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-40) Tika needs to support diverse character encodings.

David Eric Pugh (Jira)
Tika needs to support diverse character encodings.
--------------------------------------------------

                 Key: TIKA-40
                 URL: https://issues.apache.org/jira/browse/TIKA-40
             Project: Tika
          Issue Type: New Feature
          Components: general
    Affects Versions: 0.1-incubator
            Reporter: Keith R. Bennett
             Fix For: 0.1-incubator


Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream.  We need to support other encodings as well.  

It would be helpful to support the specification of an encoding in the parse method.  

Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream.  (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-40) Tika needs to support diverse character encodings.

David Eric Pugh (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-40?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531691 ]

Thilo Goetz commented on TIKA-40:
---------------------------------

ICU (http://www.icu-project.org/) has relatively good automatic code page detection.  Might be worth considering.

> Tika needs to support diverse character encodings.
> --------------------------------------------------
>
>                 Key: TIKA-40
>                 URL: https://issues.apache.org/jira/browse/TIKA-40
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>
> Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream.  We need to support other encodings as well.  
> It would be helpful to support the specification of an encoding in the parse method.  
> Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream.  (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-40) Tika needs to support diverse character encodings.

David Eric Pugh (Jira)
In reply to this post by David Eric Pugh (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-40?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532604 ]

Jukka Zitting commented on TIKA-40:
-----------------------------------

See http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html for a description on how Mozilla detects character encodings.

Also, the java.nio interfaces are designed support automatic detection of encodings, but I don't know of any such implementations.

> Tika needs to support diverse character encodings.
> --------------------------------------------------
>
>                 Key: TIKA-40
>                 URL: https://issues.apache.org/jira/browse/TIKA-40
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>
> Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream.  We need to support other encodings as well.  
> It would be helpful to support the specification of an encoding in the parse method.  
> Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream.  (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-40) Tika needs to support diverse character encodings.

David Eric Pugh (Jira)
In reply to this post by David Eric Pugh (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-40:
------------------------------

    Attachment: TIKA-40.patch

Attached is a patch (TIKA-40.patch) that uses ICU4J to automatically detect the character encoding of the parsed stream.

Notably, the modified TXTParser accepts a Metadata.CONTENT_ENCODING hint to be passed in as a part of the metadata object.

Also, the parser will set the Metadata.CONTENT_TYPE, Metadata.CONTENT_ENCODING, and even (if available) Metadata.CONTENT_LANGUAGE (and Metadata.LANGUAGE) metadata fields.

> Tika needs to support diverse character encodings.
> --------------------------------------------------
>
>                 Key: TIKA-40
>                 URL: https://issues.apache.org/jira/browse/TIKA-40
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-40.patch
>
>
> Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream.  We need to support other encodings as well.  
> It would be helpful to support the specification of an encoding in the parse method.  
> Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream.  (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-40) Tika needs to support diverse character encodings.

David Eric Pugh (Jira)
In reply to this post by David Eric Pugh (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-40.
-------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Patch committed in revision 583443.

> Tika needs to support diverse character encodings.
> --------------------------------------------------
>
>                 Key: TIKA-40
>                 URL: https://issues.apache.org/jira/browse/TIKA-40
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-40.patch
>
>
> Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream.  We need to support other encodings as well.  
> It would be helpful to support the specification of an encoding in the parse method.  
> Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream.  (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.