[jira] [Comment Edited] (TIKA-3005) Unintelligible text content from PDF file

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-3005) Unintelligible text content from PDF file

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990125#comment-16990125 ]

Tilman Hausherr edited comment on TIKA-3005 at 12/6/19 7:53 PM:
----------------------------------------------------------------

Yeah "identity" is incorrect here, it is just a bad map. But there are other files were the assumption is correct so we get a good extraction. With this file PDFBox doesn't see this as a "bad extraction".

What you could so is to parse the ToUnicode stream yourself and then make an assumption that is different than the voodoo we do in PDFont.loadUnicodeCmap(). Of course, Tika would then be slower because the ToUnicode stream would be parsed twice.


was (Author: tilman):
Yeah "identity" is incorrect here, it is just a bad map. But there are other files were the assumption is correct so we get a good extraction. With this file PDFBox doesn't see this as a "bad extraction".

What you could so is to parse the ToUnicode stream yourself and then make an assumption that is different than the voodoo we do in PDFont.loadUnicodeCmap(). Of course, Tika would then be slower because the cmap would be parsed twice.

> Unintelligible text content from PDF file
> -----------------------------------------
>
>                 Key: TIKA-3005
>                 URL: https://issues.apache.org/jira/browse/TIKA-3005
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.22
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: file1.pdf, file2.pdf, file3.pdf, resume_4.pdf
>
>
> If I get text content from attachment, Tika doesn't fail but the content is unintelligible



--
This message was sent by Atlassian Jira
(v8.3.4#803005)