If someone wants to help with this, the first step would be to come up with reasonably simple steps to get a liberally licensed OCR engine like OCRopus installed and configured so that you can invoke it using a simple command line like "ocr image.gif" and get the extracted text on the standard output. It should work for at least a few simple test cases. Note that this work should be contributed back to the upstream project.
Once we have something like that, we can move forward with integrating it to Tika.
> OCR support
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93 > Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Priority: Minor
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.