[jira] [Comment Edited] (TIKA-3005) Unintelligible text content from PDF file

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-3005) Unintelligible text content from PDF file

Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992703#comment-16992703 ]

Tim Allison edited comment on TIKA-3005 at 12/10/19 4:18 PM:
-------------------------------------------------------------

[~Giorgy], you can _open_ those PDFs correctly in a viewer, but if you copy text from it or "save as" text, you get junk.  I just checked all 4 with Chrome (at least).  This is the fault of the software that put together those PDFs.

 

To your point "it is impossible to manage the scenario in automatic tools that manages high volumes of files" and [~tilman]'s point "What we'd need is a gibberish detector", I opened an issue for this a long time ago (TIKA-1443), and we now have an imperfect solution in the tika-eval module...which we call the "out of vocabulary" percentage.  

 

In short, we took the top 30k most common words from ~130 languages from Leipzig corpus, we run language detection on extracted text and then report the percentage of out of vocabulary tokens in the text.  If that % is high, that means the original document could be a legitimately full of non-dictionary items _or_ it could be gibberish electronic text.

 

Google was using this technique _as the strawman_ 10 years ago, and they're either using more modern techniques now or they're running OCR on every PDF.

See slides from my Activate talk for the details: [https://github.com/tballison/share/blob/master/slides/activate19/Activate2019_tika_tallison_20190911.pptx]


was (Author: [hidden email]):
[~Giorgy], you can _open_ those PDFs correctly in a viewer, but if you copy text from it or "save as" text, you get junk.  I just checked all 4 with Chrome (at least).  This is the fault of the software that put together those PDFs.

 

To your point "it is impossible to manage the scenario in automatic tools that manages high volumes of files" and [~tilman]'s point "What we'd need is a gibberish detector", I opened an issue for this a long time ago (TIKA-1443) we now have an imperfect solution in the tika-eval module...which we call the "out of vocabulary" percentage.  

 

In short, we took the top 30k most common words from ~130 languages from Leipzig corpus, we run language detection on extracted text and then report the percentage of out of vocabulary tokens in the text.  If that % is high, that means the original document could be a legitimately full of non-dictionary items _or_ it could be gibberish electronic text.

 

Google was using this technique _as the strawman_ 10 years ago, and they're either using more modern techniques now or they're running OCR on every PDF.

See slides from my Activate talk for the details: https://github.com/tballison/share/blob/master/slides/activate19/Activate2019_tika_tallison_20190911.pptx

> Unintelligible text content from PDF file
> -----------------------------------------
>
>                 Key: TIKA-3005
>                 URL: https://issues.apache.org/jira/browse/TIKA-3005
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.22
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: file1.pdf, file2.pdf, file3.pdf, resume_4.pdf
>
>
> If I get text content from attachment, Tika doesn't fail but the content is unintelligible



--
This message was sent by Atlassian Jira
(v8.3.4#803005)