[jira] Created: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
Tika CLI mangles utf-8 content in text (-t) mode
------------------------------------------------

                 Key: TIKA-324
                 URL: https://issues.apache.org/jira/browse/TIKA-324
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 0.4, 0.3
         Environment: Mac OS 10.5, java version "1.6.0_15"
            Reporter: Peter Wolanin
            Priority: Critical
             Fix For: 0.5
         Attachments: test.txt


When using the -t flag to tika, multi-byte content is destroyed in the output.

Example:

{code}
$ java -jar tika-app-0.4.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
{code}


see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Wolanin updated TIKA-324:
-------------------------------

    Attachment: test.txt

attaching little test ext file.

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> {code}
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> {code}
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778133#action_12778133 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

Examining the TikaCLI.java code, the xhtml versus text output is handled very differently.  I'm not sure why the text one fails, but it seems to be easily rectified by applying the trasformer using "text" as the method.

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> {code}
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> {code}
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Wolanin updated TIKA-324:
-------------------------------

    Attachment: TIKA-324.patch


Attached is a patch against Tika 0.4.  It resolves the bug for me, at least for the simple test case.

{code}
$ java -jar tika-app-0.4.jar -t ./test.txt
Iñtërnâtiônàlizætiøn

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
{code}



> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> {code}
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> {code}
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778134#action_12778134 ]

Peter Wolanin edited comment on TIKA-324 at 11/15/09 6:01 PM:
--------------------------------------------------------------


Attached is a patch against Tika 0.4.  It resolves the bug for me, at least for the simple test case.


$ java -jar tika-app-0.4.jar -t ./test.txt
Iñtërnâtiônàlizætiøn

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>




      was (Author: pwolanin):
   
Attached is a patch against Tika 0.4.  It resolves the bug for me, at least for the simple test case.

{code}
$ java -jar tika-app-0.4.jar -t ./test.txt
Iñtërnâtiônàlizætiøn

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
{code}


 

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> {code}
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> {code}
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778135#action_12778135 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

note:  test string origin is:  http://intertwingly.net/stories/2004/04/14/i18n.html

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> {code}
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> {code}
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Wolanin updated TIKA-324:
-------------------------------

    Description:

When using the -t flag to tika, multi-byte content is destroyed in the output.

Example:


$ java -jar tika-app-0.4.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>



see also:  http://drupal.org/node/622508#comment-2267918

  was:

When using the -t flag to tika, multi-byte content is destroyed in the output.

Example:

{code}
$ java -jar tika-app-0.4.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n

$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
{code}


see also:  http://drupal.org/node/622508#comment-2267918


The bug is confirmed as present in 0.3 also.

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778140#action_12778140 ]

Peter Wolanin edited comment on TIKA-324 at 11/15/09 6:20 PM:
--------------------------------------------------------------

The bug is confirmed as present in 0.3 also.

$ java -jar tika-0.3.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n


      was (Author: pwolanin):
    The bug is confirmed as present in 0.3 also.
 

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778144#action_12778144 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

The code in the TikaCLI.java seems to have changed in trunk - not clear if the bug is still present.

TikaGUI.java has something very similar to the code as altered in this patch, yet it correctly renders the test string in 0.4.  The output seems to go via a StringWriter rather than directly to System.out, which my make the difference?

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778148#action_12778148 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

Bug is still present in trunk (and code tagged for 0.5)


$ java -jar tika-app/target/tika-app-0.6-SNAPSHOT.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n



> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Wolanin updated TIKA-324:
-------------------------------

    Attachment: TIKA-324.patch
                TIKA-324-0.5.patch


Here is a patch for tika 0.5/trunk that resolves the bug (1 line change) and a revised patch for 0.4 that sets indent to "true" for consistency.

For a quick test PDF - look at:  http://nlp.stanford.edu/IR-book/pdf/00front.pdf

Without the patch, the math symbols like ω,ωk are obliterated.

> Tika CLI mangles utf-8 content in text (-t) mode
> ------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-324:
-------------------------------

    Affects Version/s: 0.5
        Fix Version/s:     (was: 0.5)
              Summary: Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)  (was: Tika CLI mangles utf-8 content in text (-t) mode)

The problem here is that the default --text output uses the default encoding, which AFAIK for Java on Mac OS X is
MacRoman even though OS X otherwise uses UTF-8. This is why all non-MacRoman characters get turned to question marks when printed to the console.

The proposed patch changes the output encoding to UTF-8 (the default for TransformerHandler) for all platforms, which can cause problems on platforms with different default encodings.

The reason why --text behaves differently from --xml and --html is that the latter are considered to output binary data that either contains explicit encoding information (<?xml version="1.0" encoding="UTF-8"?> for --xml) or works around the encoding issue in other ways (I&ntilde;t&euml;rn&acirc;ti&ocirc;n&agrave;liz&aelig;ti&oslash;n for --html). The --text output is expected to be consumable by default text processing tools of the platform (grep  "Iñtërnâtiônàlizætiøn"), so it needs to use the correct character encoding.

To avoid breaking things on other platforms, I suggest that we only override the default encoding on Mac OS X, like this:

Index: tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
===================================================================
--- tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java (revision 880772)
+++ tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java (working copy)
@@ -228,6 +228,10 @@
             throws UnsupportedEncodingException {
         if (encoding != null) {
             return new OutputStreamWriter(System.out, encoding);
+        } else if (System.getProperty("os.name")
+                .toLowerCase().startsWith("mac os x")) {
+            // TIKA-324: Override the default encoding on Mac OS X
+            return new OutputStreamWriter(System.out, "UTF-8");
         } else {
             return new OutputStreamWriter(System.out);
         }


> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-324:
-------------------------------

    Attachment: TIKA-324-macosx.patch

Hmm, the above patch doesn't look too good in HTML.. I've attached it as the TIKA-324-macosx.patch file.

> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778923#action_12778923 ]

Jukka Zitting commented on TIKA-324:
------------------------------------

This fix won't make it in Tika 0.5, but see TIKA-277 for the new --encoding option that allows you to work around this issue:

$ java -jar tika-app-0.5.jar --text --encoding=UTF-8 ./test.txt
Iñtërnâtiônàlizætiøn

> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778926#action_12778926 ]

Peter Wolanin commented on TIKA-324:
------------------------------------


In fact for tika 0.4 it looks like it works already to pass this option to java:


-Dfile.encoding=UTF8

$java -Dfile.encoding=UTF8 -jar orig-tika-app-0.4.jar -t ./test.txt
Iñtërnâtiônàlizætiøn


> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778928#action_12778928 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

Also, this is not a Mac-only problem- I have the same issue, for example, on CentOS using java version "1.6.0_04"

[root@i:~] java -jar tika-app-0.4.jar -t test.txt
I?t?rn?ti?n?liz?ti?n


> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778945#action_12778945 ]

Jukka Zitting commented on TIKA-324:
------------------------------------

Yes, the -Dfile.encoding option forces Java to use that encoding as the default for all IO.

Can you check what your default encoding is on CentOS? That's normally set in the LANG environment variable, so running "echo $LANG" should give the information.


> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778951#action_12778951 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

on Mac OS 10.5 it looks correct:
$echo $LANG
en_US.UTF-8


on CentOS 5, no value is set:
echo $LANG


If I set that value on CenOS (to the same as my Mac) then output is correct:
[root@i:~] export LANG=en_US.UTF-8
[root@i:~] java -jar tika-app-0.4.jar -t test.txt
Iñtërnâtiônàlizætiøn





> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Priority: Critical
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-324.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

OK. I've committed the latest patch to trunk. The code now never uses the default platform encoding on Mac OS X, opting instead for UTF-8 as the default. People can still override the setting with an explicit --encoding argument.

For the CentOS case I recommend just setting the LANG environment variable correctly, as that's used also by other programs and there is no other easy way for Tika or Java to figure out which encoding should be used on that platform.

> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.6
>
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783134#action_12783134 ]

Peter Wolanin commented on TIKA-324:
------------------------------------

There is a logical bug in the committed code: -encoding= does not work, fails with exceptions like:

Exception in thread "main" java.io.UnsupportedEncodingException: ncoding=UTF-8


note "ncoding".  Opening follow-up issue.

> Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
> --------------------------------------------------------------
>
>                 Key: TIKA-324
>                 URL: https://issues.apache.org/jira/browse/TIKA-324
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS 10.5, java version "1.6.0_15"
>            Reporter: Peter Wolanin
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.6
>
>         Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When using the -t flag to tika, multi-byte content is destroyed in the output.
> Example:
> $ java -jar tika-app-0.4.jar -t ./test.txt
> I?t?rn?ti?n?liz?ti?n
> $ java -jar tika-app-0.4.jar -x ./test.txt
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body>
> <p>Iñtërnâtiônàlizætiøn
> </p>
> </body>
> </html>
> see also:  http://drupal.org/node/622508#comment-2267918

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

12