[jira] [Commented] (TIKA-3005) Unintelligible text content from PDF file

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3005) Unintelligible text content from PDF file

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990125#comment-16990125 ]

Tilman Hausherr commented on TIKA-3005:
---------------------------------------

Yeah "identity" is incorrect here, it is just a bad map. But there are other files were the assumption is correct so we get a good extraction. With this file PDFBox doesn't see this as a "bad extraction".

What you could so is to parse the ToUnicode stream yourself and then make an assumption that is different than the voodoo we do in PDFont.loadUnicodeCmap(). Of course, Tika would then be slower because the cmap would be parsed twice.

> Unintelligible text content from PDF file
> -----------------------------------------
>
>                 Key: TIKA-3005
>                 URL: https://issues.apache.org/jira/browse/TIKA-3005
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.22
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: file1.pdf, file2.pdf, file3.pdf, resume_4.pdf
>
>
> If I get text content from attachment, Tika doesn't fail but the content is unintelligible



--
This message was sent by Atlassian Jira
(v8.3.4#803005)