[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933957#comment-15933957 ]

Tim Allison commented on TIKA-2293:

[~ThejanWijesinghe], thank you for sharing this and running some comparisons with our current Tesseract parser.

I really like:
 1. The notion that users don't have to figure out how to install Tesseract on their system.  "Simple" plug and play.
 2. The theoretical simplicity of not having to create the temp files and make a system call to python and tesseract etc.
 3. The notion of being able to use some of the lower-level features of Tesseract that aren't available from the commandline...but I only have a vague notion of these...what features from the underlying Tesseract do we need that aren't available from the commandline?

I'm concerned about:
 1a. The LGPL license on ghost4j means that we can't bundle that with our jars. Do I understand the license of ghost4j?  If so, and if we don't include ghost4j, what will happen?  Is that only used for PDFs...so we'd be on our own for those, right?
 1b. There's another LGPL license on leptonica4j's rococoa dependency.  What happens if we can't bundle that?
 2.  The general notion of packaging native libs.  I undid that choice with our sqlite parser and required that users add that jar to their classpath.
 3.  We'd be adding 38 MB to the tika-app and tika-server jars.  That's just for the Windows dlls, right? Do I understand correctly that Linux users would be on their own to install {{libtesseract.so}}?
 4. tess4j comes with the English language pack.  Users who wanted other languages would still have to grab and install the other language packs in the tess-data directory, which cuts into the appeal for "runs tesseract out of the box".

>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
> Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process.  

This message was sent by Atlassian JIRA