[jira] Created: (TIKA-335) TXTParser use of CharsetDetector has several bugs

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-335) TXTParser use of CharsetDetector has several bugs

JIRA jira@apache.org
TXTParser use of CharsetDetector has several bugs
-------------------------------------------------

                 Key: TIKA-335
                 URL: https://issues.apache.org/jira/browse/TIKA-335
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.5
            Reporter: Ken Krugler


In looking at how TXTParser uses CharsetDetector, I see the following issues:

1. The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
2. The first supported charset should be used, not the last. These are returned in confidence order, from best to worst.
3. The current code might also wind up setting a language from one result, and the charset from another.

So the biggest change is to bail out of the loop once a supported charset has been found.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-335) TXTParser should use incoming charset

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-335:
-----------------------------

       Priority: Minor  (was: Major)
    Description:
The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().


  was:
In looking at how TXTParser uses CharsetDetector, I see the following issues:

1. The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
2. The first supported charset should be used, not the last. These are returned in confidence order, from best to worst.
3. The current code might also wind up setting a language from one result, and the charset from another.

So the biggest change is to bail out of the loop once a supported charset has been found.

     Issue Type: Improvement  (was: Bug)
        Summary: TXTParser should use incoming charset  (was: TXTParser use of CharsetDetector has several bugs)

> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>
> The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-335) TXTParser should use incoming charset

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-335:
-----------------------------

    Attachment: TIKA-335.patch

This patch also cleans up some generics warnings (sorry about mixing the two, I was going to open a second issue but the two were co-mingled).

In order to make this work, I had to modify the charset detection code to actually use the hint - weird that ICU never actually implemented this.

Includes a test case for an ambiguous run of text that could be UTF-8 or 8859-1.

> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-335.patch
>
>
> The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-335) TXTParser should use incoming charset

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783511#action_12783511 ]

Jukka Zitting commented on TIKA-335:
------------------------------------

Looks good, thanks!

Is the new UTF-8/ISO-8859-1 test case supposed to pass? I'm getting the following test failure after I apply the patch:

testUseIncomingCharsetAsHint(org.apache.tika.parser.txt.TXTParserTest)  Time elapsed: 0.007 sec  <<< FAILURE!
junit.framework.ComparisonFailure: expected:<ISO-8859-1> but was:<UTF-8>
        at junit.framework.Assert.assertEquals(Assert.java:81)
        at junit.framework.Assert.assertEquals(Assert.java:87)
        at org.apache.tika.parser.txt.TXTParserTest.testUseIncomingCharsetAsHint(TXTParserTest.java:121)


> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-335.patch
>
>
> The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-335) TXTParser should use incoming charset

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784026#action_12784026 ]

Ken Krugler commented on TIKA-335:
----------------------------------

It should, yes - it passes in both Eclipse and in the Maven build.

Could be another case of UTF-8 in a string, similar to TIKA-334. Try using this in the testUsingIncomingCharsetAsHint:

        final String test2 = "the name is \u00e1ndre";


> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-335.patch
>
>
> The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-335) TXTParser should use incoming charset

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-335:
-----------------------------

    Attachment: TIKA-335-2.patch

Minor improvement to test case - avoid use of UTF-8 chars in strings (use \uxxxx sequences instead).

> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-335-2.patch, TIKA-335.patch
>
>
> The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-335) TXTParser should use incoming charset

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-335.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

Thanks, now the test passes for me too. I committed the patches in revision 890011.

> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: TIKA-335-2.patch, TIKA-335.patch
>
>
> The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.