[jira] Commented: (TIKA-93) OCR support

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (TIKA-93) OCR support

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746586#action_12746586 ]

Jukka Zitting commented on TIKA-93:

> are there any updates regarding this issue?

Not really. I've done some simple tests with ExternalParser invoking Tesseract and OCRopus, but neither is really suited for simple OOTB integration.

I also tried the commercial Asprise OCR SDK (http://asprise.com/product/ocr/index.php?lang=java) which was much easier to set up and get reasonable results from, but obviously it's something that we can't use in an Apache project.

If someone wants to help with this, the first step would be to come up with reasonably simple steps to get a liberally licensed OCR engine like OCRopus installed and configured so that you can invoke it using a simple command line like "ocr image.gif" and get the extracted text on the standard output. It should work for at least a few simple test cases. Note that this work should be contributed back to the upstream project.

Once we have something like that, we can move forward with integrating it to Tika.

> OCR support
> -----------
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.