[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316888#comment-17316888 ]

Shmuel Krakower commented on TIKA-3258:
---------------------------------------

Yes, indeed! (sorry for the delay)

OK so as long as it is set to true, each embedded image will also be OCRed, no matter if the entire page was elected for full page render (either by OCR_AND_TEXT_EXTRACTION or by AUTO). Right?

> Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
> ---------------------------------------------------------
>
>                 Key: TIKA-3258
>                 URL: https://issues.apache.org/jira/browse/TIKA-3258
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0
>
>
> In Tika 1.x we currently have the fiddly mess that users have to configure OCR of PDFs...it doesn't just work out of the box.  We did this initially because of concerns (well, reality) of crazy resource consumption for some PDFs that can have thousands of images per page that are stitched together to make a reasonable composite.
> Since then, we've added option 2, which renders each page and then runs OCR on that composite image rather than running OCR on each inline image...so we'll only call tesseract once per page.  Second, we've added an 'auto' mode that runs OCR only on pages that didn't have much text extracted.  While there is plenty of room for improvement in the 'auto' heuristic, I think we should move to running OCR automatically on PDFs as default in 2.0.0.
> Under this proposal, users will now have to disable OCR if they have tesseract installed but don't want to run it on PDFs.
> This will be a breaking change, and we'll make sure to document it early and often in the "Breaking Changes" sections of the readme.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)