Support Tesseract in Apache Solr

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Support Tesseract in Apache Solr

Karan Jain
Hi All,

The Solr version 7.6.0 is running on my local machine. I have installed
Tesseract through following steps:-
yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>~/.bash_profile
echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile

Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
in https://github.com/apache/lucene-solr and found no reference there. I
could not understand How Solr came to know about the deployed tesseract.
Please tell the specific java class in Solr if possible.

Thanks for your time,
Best,
Karan
Reply | Threaded
Open this post in threaded view
|

Re: Support Tesseract in Apache Solr

Jörn Franke
Honestly i would not run tesseract on the same server as Solr. It takes a lot of resources and may negatively impact Solr. Just write a small program using Tika+Tesseract that runs on a different server / container and posts the results to Solr.

About your question: Probably Tika (a dependency of Solr) figured it out or depending on your format Pdfbox (used by Tika).

> Am 11.02.2020 um 19:15 schrieb Karan Jain <[hidden email]>:
>
> Hi All,
>
> The Solr version 7.6.0 is running on my local machine. I have installed
> Tesseract through following steps:-
> yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>> ~/.bash_profile
> echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
>
> Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> in https://github.com/apache/lucene-solr and found no reference there. I
> could not understand How Solr came to know about the deployed tesseract.
> Please tell the specific java class in Solr if possible.
>
> Thanks for your time,
> Best,
> Karan
Reply | Threaded
Open this post in threaded view
|

Re: Support Tesseract in Apache Solr

Edward Ribeiro
I second Jorn: don't deploy Tesseract + Tika on the same server as Solr.
Tesseract, specially with OCR enabled, will drain your machine resources
that could be used to indexing/searching. In addition to that, any
malformed PDF could potentially shutdown the Solr server. Best bet would be
to use tika-server + tesseract on a dedicated server/container and then use
it to extract the text/ocr from the documents and then send it to Solr.

But answering your question: Solr embeds Tika that can be configured to use
Tesseract. It's Tika that knows about Tesseract. See here:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for more
information.

Best regards,
Edward

On Tue, Feb 11, 2020 at 3:26 PM Jörn Franke <[hidden email]> wrote:

> Honestly i would not run tesseract on the same server as Solr. It takes a
> lot of resources and may negatively impact Solr. Just write a small program
> using Tika+Tesseract that runs on a different server / container and posts
> the results to Solr.
>
> About your question: Probably Tika (a dependency of Solr) figured it out
> or depending on your format Pdfbox (used by Tika).
>
> > Am 11.02.2020 um 19:15 schrieb Karan Jain <[hidden email]>:
> >
> > Hi All,
> >
> > The Solr version 7.6.0 is running on my local machine. I have installed
> > Tesseract through following steps:-
> > yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
> >>> ~/.bash_profile
> > echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> >
> > Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> > in https://github.com/apache/lucene-solr and found no reference there. I
> > could not understand How Solr came to know about the deployed tesseract.
> > Please tell the specific java class in Solr if possible.
> >
> > Thanks for your time,
> > Best,
> > Karan
>