[jira] [Commented] (TIKA-93) OCR support

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-93) OCR support

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810 ]

Luis Filipe Nassif commented on TIKA-93:
----------------------------------------

Thank you very much [~tpalsulich] for including unit tests! We could also include tests for normal images (not embedded).

There is a simple timeout control that throws a TikaException with specific message if it happens. The idea to force setting a TesseractOCRConfig object in parseContext to run OCR is to not affect users that do not want OCR, exactly because it could take seconds, even minutes. So TesseractOCRParser can be included in Tika Parser list by default with no problem. We also could include a warning about OCR slowness in the class description.

I have no idea how to include Tesseract in the sources. Maybe Tika commiters can help with this?

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (TIKA-93) OCR support

Oleg Tikhonov-2
Guys,
Tesseract is by itself a project that written on C/C++ and should be
compiled differently for each platform.
Personally, i would put a requirement for those who want to work with
tesseract. Not sure that putting Tesseract in the sources is a right way to
go.

>>How good tesseract is -  depends on trained data at least + quality of
the input images. No simple answer exists.

BR,
Oleg


On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) <[hidden email]
> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810]
>
> Luis Filipe Nassif commented on TIKA-93:
> ----------------------------------------
>
> Thank you very much [~tpalsulich] for including unit tests! We could also
> include tests for normal images (not embedded).
>
> There is a simple timeout control that throws a TikaException with
> specific message if it happens. The idea to force setting a
> TesseractOCRConfig object in parseContext to run OCR is to not affect users
> that do not want OCR, exactly because it could take seconds, even minutes.
> So TesseractOCRParser can be included in Tika Parser list by default with
> no problem. We also could include a warning about OCR slowness in the class
> description.
>
> I have no idea how to include Tesseract in the sources. Maybe Tika
> commiters can help with this?
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Assignee: Chris A. Mattmann
> >            Priority: Minor
> >             Fix For: 1.6
> >
> >         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
> >
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (TIKA-93) OCR support

Tyler Palsulich
Hi,

> Tesseract is by itself a project that written on C/C++ and should be compiled
differently for each platform.
Good point! We should figure out a way to fail gracefully when Tesseract
isn't installed, right? Unless there is, in fact, some pure Java OCR
implementation.

Another thought, we should add OCR as a command line option -- one option
for extracting images, one for running OCR (which always enables image
extraction).

Tyler


On Thu, May 29, 2014 at 1:26 PM, Oleg Tikhonov <[hidden email]> wrote:

> Guys,
> Tesseract is by itself a project that written on C/C++ and should be
> compiled differently for each platform.
> Personally, i would put a requirement for those who want to work with
> tesseract. Not sure that putting Tesseract in the sources is a right way to
> go.
>
> >>How good tesseract is -  depends on trained data at least + quality of
> the input images. No simple answer exists.
>
> BR,
> Oleg
>
>
> On Thu, May 29, 2014 at 11:07 PM, Luis Filipe Nassif (JIRA) <
> [hidden email]
> > wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012810#comment-14012810
> ]
> >
> > Luis Filipe Nassif commented on TIKA-93:
> > ----------------------------------------
> >
> > Thank you very much [~tpalsulich] for including unit tests! We could also
> > include tests for normal images (not embedded).
> >
> > There is a simple timeout control that throws a TikaException with
> > specific message if it happens. The idea to force setting a
> > TesseractOCRConfig object in parseContext to run OCR is to not affect
> users
> > that do not want OCR, exactly because it could take seconds, even
> minutes.
> > So TesseractOCRParser can be included in Tika Parser list by default with
> > no problem. We also could include a warning about OCR slowness in the
> class
> > description.
> >
> > I have no idea how to include Tesseract in the sources. Maybe Tika
> > commiters can help with this?
> >
> > > OCR support
> > > -----------
> > >
> > >                 Key: TIKA-93
> > >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> > >             Project: Tika
> > >          Issue Type: New Feature
> > >          Components: parser
> > >            Reporter: Jukka Zitting
> > >            Assignee: Chris A. Mattmann
> > >            Priority: Minor
> > >             Fix For: 1.6
> > >
> > >         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> > TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> > TesseractOCR_Tyler.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
> > >
> > >
> > > I don't know of any decent open source pure Java OCR libraries, but
> > there are command line OCR tools like Tesseract (
> > http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika
> to
> > extract text content (where available) from image files.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.2#6252)
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (TIKA-93) OCR support

Nick Burch-2
On Mon, 2 Jun 2014, Tyler Palsulich wrote:
> Good point! We should figure out a way to fail gracefully when Tesseract
> isn't installed, right? Unless there is, in fact, some pure Java OCR
> implementation.

I believe the standard policy is that a parser which can't work should
either thrown an exception during construction, or return an empty set of
types to a call to getSupportedTypes. Either one lets it be gracefully
skipped over

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (TIKA-93) OCR support

Tyler Palsulich
>
> I believe the standard policy is that a parser which can't work should
> either thrown an exception during construction, or return an empty set of
> types to a call to getSupportedTypes. Either one lets it be gracefully
> skipped over


How do we know when Tesseract is installed? There isn't an easy,
cross-platform Java method to check if a given program is installed. Maybe,
we make the user specify the install location in some config file? Then,
don't have to worry about Tesseract being on the path or not.
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (TIKA-93) OCR support

Nick Burch-2
On Mon, 2 Jun 2014, Tyler Palsulich wrote:
> How do we know when Tesseract is installed? There isn't an easy,
> cross-platform Java method to check if a given program is installed.
> Maybe, we make the user specify the install location in some config
> file? Then, don't have to worry about Tesseract being on the path or
> not.

The same way that Ray's stuff checks to see if exiftool is installed or
not. I'd suggest you crib of all the work he has already done on calling
out to native programs from Tika!

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (TIKA-93) OCR support

Mattmann, Chris A (3010)
+1, I talked to Tyler a little bit ago and told him to grep for
exiftool in Tika :) he will scope.




-----Original Message-----
From: Nick Burch <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, June 2, 2014 8:36 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: [jira] [Commented] (TIKA-93) OCR support

>On Mon, 2 Jun 2014, Tyler Palsulich wrote:
>> How do we know when Tesseract is installed? There isn't an easy,
>> cross-platform Java method to check if a given program is installed.
>> Maybe, we make the user specify the install location in some config
>> file? Then, don't have to worry about Tesseract being on the path or
>> not.
>
>The same way that Ray's stuff checks to see if exiftool is installed or
>not. I'd suggest you crib of all the work he has already done on calling
>out to native programs from Tika!
>
>Nick