[jira] Commented: (TIKA-469) The Parser is not correctly outputting Arabic text documents

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-469) The Parser is not correctly outputting Arabic text documents

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995382#comment-12995382 ]

Ken Krugler commented on TIKA-469:
----------------------------------

Hi Robert - do you have an example of an HTML file?

I'm asking because if an HTML document is encoded as UTF-8, the only reasona I can think of for the character encoding to be messed up is if (a) the HTML meta tag uses an encoding name that isn't supported by Java, or (b) there is no charset specified in the response header or the HTML meta tags, and the algorithmic detection of the character encoding is also failing.

Thanks,

-- Ken

> The Parser is not correctly outputting Arabic text documents
> ------------------------------------------------------------
>
>                 Key: TIKA-469
>                 URL: https://issues.apache.org/jira/browse/TIKA-469
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Windows XP
>            Reporter: Robert Cullen
>         Attachments: TEST_WORD.doc, fever_factsheet_arabic.pdf
>
>
> The parser is not preserving the character encoding when parsing documents in Arabic UTF-8, specifically with .pdf and .doc.  The resulting character output is undechipherable or just question-mark symbols.

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira