Solr OCR Support

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr OCR Support

kamaci
Hi All,

I want to index images and pdf documents which have images into Solr. I
test it with my Solr 6.3.0.

I've installed tesseract at my computer (Mac). I verify that Tesseract
works fine to extract text from an image.

I index image into Solr but it has no content. However, as far as I know, I
don't need to do anything else to integrate Tesseract with Solr.

I've checked these but they were not useful for me:

http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html

My question is, how can I support OCR with Solr?

Kind Regards,
Furkan KAMACI
Reply | Threaded
Open this post in threaded view
|

Re: Solr OCR Support

Tim Allison
OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
have an open ticket to make it "just work", but we aren't there yet
(TIKA-2749).

You have to tell Tika how you want to process images from PDFs via the
tika-config.xml file.

You've seen this link in the links you mentioned:
https://wiki.apache.org/tika/TikaOCR

This one is key for PDFs:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI <[hidden email]> wrote:

>
> Hi All,
>
> I want to index images and pdf documents which have images into Solr. I
> test it with my Solr 6.3.0.
>
> I've installed tesseract at my computer (Mac). I verify that Tesseract
> works fine to extract text from an image.
>
> I index image into Solr but it has no content. However, as far as I know, I
> don't need to do anything else to integrate Tesseract with Solr.
>
> I've checked these but they were not useful for me:
>
> http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
> http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
> My question is, how can I support OCR with Solr?
>
> Kind Regards,
> Furkan KAMACI
Reply | Threaded
Open this post in threaded view
|

RE: Solr OCR Support

Davis, Daniel (NIH/NLM) [C]
I think that you also have to process a PDF pretty deeply to decide if you want it to be OCR.   I have worked on projects where all of the PDFs are really like faxes - images are encoded in JBIG2 black and white or similar, and there is really one image per page, and no text.   I have also worked on projects where it really is unstructured data, but if a PDF has one image per page and have no text, they should be OCRd.

I've had problems, not with Tesseract, but even with Nuance OCR OEM libraries, where text was missed because one image was the top of the letters, and the image on the next line was the bottom half of the letters.   I don't mean to ding Nuance (or tesseract), I just wish to point out that what to OCR is important, because OCR works well when it has good input.

> -----Original Message-----
> From: Tim Allison <[hidden email]>
> Sent: Friday, November 2, 2018 11:03 AM
> To: [hidden email]
> Subject: Re: Solr OCR Support
>
> OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
> have an open ticket to make it "just work", but we aren't there yet
> (TIKA-2749).
>
> You have to tell Tika how you want to process images from PDFs via the
> tika-config.xml file.
>
> You've seen this link in the links you mentioned:
> https://wiki.apache.org/tika/TikaOCR
>
> This one is key for PDFs:
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
> On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI <[hidden email]>
> wrote:
> >
> > Hi All,
> >
> > I want to index images and pdf documents which have images into Solr. I
> > test it with my Solr 6.3.0.
> >
> > I've installed tesseract at my computer (Mac). I verify that Tesseract
> > works fine to extract text from an image.
> >
> > I index image into Solr but it has no content. However, as far as I know, I
> > don't need to do anything else to integrate Tesseract with Solr.
> >
> > I've checked these but they were not useful for me:
> >
> > http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-
> td4201834.html
> > http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-
> td4361908.html
> >
> > My question is, how can I support OCR with Solr?
> >
> > Kind Regards,
> > Furkan KAMACI
Reply | Threaded
Open this post in threaded view
|

Re: Solr OCR Support

Tim Allison
+1 Thank you, Daniel.  If you have any interest in helping out on
TIKA-2749, please join the fun. :D
On Fri, Nov 2, 2018 at 12:12 PM Davis, Daniel (NIH/NLM) [C]
<[hidden email]> wrote:

>
> I think that you also have to process a PDF pretty deeply to decide if you want it to be OCR.   I have worked on projects where all of the PDFs are really like faxes - images are encoded in JBIG2 black and white or similar, and there is really one image per page, and no text.   I have also worked on projects where it really is unstructured data, but if a PDF has one image per page and have no text, they should be OCRd.
>
> I've had problems, not with Tesseract, but even with Nuance OCR OEM libraries, where text was missed because one image was the top of the letters, and the image on the next line was the bottom half of the letters.   I don't mean to ding Nuance (or tesseract), I just wish to point out that what to OCR is important, because OCR works well when it has good input.
>
> > -----Original Message-----
> > From: Tim Allison <[hidden email]>
> > Sent: Friday, November 2, 2018 11:03 AM
> > To: [hidden email]
> > Subject: Re: Solr OCR Support
> >
> > OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
> > have an open ticket to make it "just work", but we aren't there yet
> > (TIKA-2749).
> >
> > You have to tell Tika how you want to process images from PDFs via the
> > tika-config.xml file.
> >
> > You've seen this link in the links you mentioned:
> > https://wiki.apache.org/tika/TikaOCR
> >
> > This one is key for PDFs:
> > https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
> > On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI <[hidden email]>
> > wrote:
> > >
> > > Hi All,
> > >
> > > I want to index images and pdf documents which have images into Solr. I
> > > test it with my Solr 6.3.0.
> > >
> > > I've installed tesseract at my computer (Mac). I verify that Tesseract
> > > works fine to extract text from an image.
> > >
> > > I index image into Solr but it has no content. However, as far as I know, I
> > > don't need to do anything else to integrate Tesseract with Solr.
> > >
> > > I've checked these but they were not useful for me:
> > >
> > > http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-
> > td4201834.html
> > > http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-
> > td4361908.html
> > >
> > > My question is, how can I support OCR with Solr?
> > >
> > > Kind Regards,
> > > Furkan KAMACI
Reply | Threaded
Open this post in threaded view
|

RE: Solr OCR Support

Phil Scadden
In reply to this post by kamaci
I would strongly consider OCR offline, BEFORE loading the documents into Solr. The  advantage of this is that you convert your OCRed PDF into searchable PDF. Consider someone using Solr and they have found a document that matches their search criteria. Once they retrieve the document, they will discover it is has not been OCRed and they cannot use a text search within a document. If the document that you are feeding Solr is large, then this is major pain. Setting up Tesseract (or whatever engine - tesseract involves a bit of a tool chain) to OCR and save as searchable PDF, means you can provide a much more useful document as the result of Solr search. Feed that searchable PDF to SolrJ with OCR turned off.

               PDFParserConfig pdfConfig = new PDFParserConfig();
               pdfConfig.setExtractInlineImages(false);
               pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
               context.set(PDFParserConfig.class,pdfConfig);
               context.set(Parser.class,parser);

-----Original Message-----
From: Furkan KAMACI <[hidden email]>
Sent: Saturday, 3 November 2018 03:30
To: [hidden email]
Subject: Solr OCR Support

Hi All,

I want to index images and pdf documents which have images into Solr. I test it with my Solr 6.3.0.

I've installed tesseract at my computer (Mac). I verify that Tesseract works fine to extract text from an image.

I index image into Solr but it has no content. However, as far as I know, I don't need to do anything else to integrate Tesseract with Solr.

I've checked these but they were not useful for me:

http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html

My question is, how can I support OCR with Solr?

Kind Regards,
Furkan KAMACI
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.
Reply | Threaded
Open this post in threaded view
|

RE: Solr OCR Support

Terry Steichen
+1
My experience is that you can't easily tell ahead of time whether your PDF is searchable or not. If it is, you may not even retrieve it because there's no text to index.  Also, if you blindly OCR a file that has already been OCR'd, it can create a mess.  Most higher end PDF editors have a batch mode to do OCR processing, if that works better for you.

On November 4, 2018 5:20:41 PM EST, Phil Scadden <[hidden email]> wrote:

>I would strongly consider OCR offline, BEFORE loading the documents
>into Solr. The  advantage of this is that you convert your OCRed PDF
>into searchable PDF. Consider someone using Solr and they have found a
>document that matches their search criteria. Once they retrieve the
>document, they will discover it is has not been OCRed and they cannot
>use a text search within a document. If the document that you are
>feeding Solr is large, then this is major pain. Setting up Tesseract
>(or whatever engine - tesseract involves a bit of a tool chain) to OCR
>and save as searchable PDF, means you can provide a much more useful
>document as the result of Solr search. Feed that searchable PDF to
>SolrJ with OCR turned off.
>
>               PDFParserConfig pdfConfig = new PDFParserConfig();
>               pdfConfig.setExtractInlineImages(false);
>         pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
>               context.set(PDFParserConfig.class,pdfConfig);
>               context.set(Parser.class,parser);
>
>-----Original Message-----
>From: Furkan KAMACI <[hidden email]>
>Sent: Saturday, 3 November 2018 03:30
>To: [hidden email]
>Subject: Solr OCR Support
>
>Hi All,
>
>I want to index images and pdf documents which have images into Solr. I
>test it with my Solr 6.3.0.
>
>I've installed tesseract at my computer (Mac). I verify that Tesseract
>works fine to extract text from an image.
>
>I index image into Solr but it has no content. However, as far as I
>know, I don't need to do anything else to integrate Tesseract with
>Solr.
>
>I've checked these but they were not useful for me:
>
>http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
>http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
>My question is, how can I support OCR with Solr?
>
>Kind Regards,
>Furkan KAMACI
>Notice: This email and any attachments are confidential and may not be
>used, published or redistributed without the prior written consent of
>the Institute of Geological and Nuclear Sciences Limited (GNS Science).
>If received in error please destroy and immediately notify GNS Science.
>Do not copy or disclose the contents.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.