regarding Extracting text from Images

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

regarding Extracting text from Images

sureshpendap
Hello,
I am reading the Solr documentation about integration with Tika and Solr
Cell framework over here
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

I would like to know if the can Solr Cell framework also be used to extract
text from the image files?

Regards
Suresh
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Alexandre Rafalovitch
I believe Tika that powers this can do so with extra libraries (tesseract?)
But Solr does not bundle those extras.

In any case, you may want to run Tika externally to avoid the
conversion/extraction process be a burden to Solr itself.

Regards,
     Alex

On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]> wrote:

> Hello,
> I am reading the Solr documentation about integration with Tika and Solr
> Cell framework over here
>
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>
> I would like to know if the can Solr Cell framework also be used to extract
> text from the image files?
>
> Regards
> Suresh
>
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

sureshpendap
Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?

Regards
Suresh

On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]>
wrote:

> I believe Tika that powers this can do so with extra libraries (tesseract?)
> But Solr does not bundle those extras.
>
> In any case, you may want to run Tika externally to avoid the
> conversion/extraction process be a burden to Solr itself.
>
> Regards,
>      Alex
>
> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
> wrote:
>
> > Hello,
> > I am reading the Solr documentation about integration with Tika and Solr
> > Cell framework over here
> >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > I would like to know if the can Solr Cell framework also be used to
> extract
> > text from the image files?
> >
> > Regards
> > Suresh
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Alexandre Rafalovitch
Again, I think you are best to do it out of Solr.

But even of you want to get it to work in Solr, I think you start by
getting it to work directly in Tika. Then, get the missing libraries and
configuration into Solr.

Regards,
    Alex

On Wed, Oct 23, 2019, 7:08 PM suresh pendap, <[hidden email]> wrote:

> Hi Alex,
> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
> to implement Custom update processor or extend the
> ExtractingRequestProcessor?
>
> Regards
> Suresh
>
> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]
> >
> wrote:
>
> > I believe Tika that powers this can do so with extra libraries
> (tesseract?)
> > But Solr does not bundle those extras.
> >
> > In any case, you may want to run Tika externally to avoid the
> > conversion/extraction process be a burden to Solr itself.
> >
> > Regards,
> >      Alex
> >
> > On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
> > wrote:
> >
> > > Hello,
> > > I am reading the Solr documentation about integration with Tika and
> Solr
> > > Cell framework over here
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> > >
> > > I would like to know if the can Solr Cell framework also be used to
> > extract
> > > text from the image files?
> > >
> > > Regards
> > > Suresh
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Erick Erickson
Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, but you can pull that part out pretty easily):
https://lucidworks.com/post/indexing-with-solrj/



> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch <[hidden email]> wrote:
>
> Again, I think you are best to do it out of Solr.
>
> But even of you want to get it to work in Solr, I think you start by
> getting it to work directly in Tika. Then, get the missing libraries and
> configuration into Solr.
>
> Regards,
>    Alex
>
> On Wed, Oct 23, 2019, 7:08 PM suresh pendap, <[hidden email]> wrote:
>
>> Hi Alex,
>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>> to implement Custom update processor or extend the
>> ExtractingRequestProcessor?
>>
>> Regards
>> Suresh
>>
>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]
>>>
>> wrote:
>>
>>> I believe Tika that powers this can do so with extra libraries
>> (tesseract?)
>>> But Solr does not bundle those extras.
>>>
>>> In any case, you may want to run Tika externally to avoid the
>>> conversion/extraction process be a burden to Solr itself.
>>>
>>> Regards,
>>>     Alex
>>>
>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
>>> wrote:
>>>
>>>> Hello,
>>>> I am reading the Solr documentation about integration with Tika and
>> Solr
>>>> Cell framework over here
>>>>
>>>>
>>>
>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>>>>
>>>> I would like to know if the can Solr Cell framework also be used to
>>> extract
>>>> text from the image files?
>>>>
>>>> Regards
>>>> Suresh
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Eric Pugh-4
Just to stir the pot on this topic, here is an article about why and how to use Tika inside of Solr:

https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/

> On Oct 23, 2019, at 7:21 PM, Erick Erickson <[hidden email]> wrote:
>
> Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, but you can pull that part out pretty easily):
> https://lucidworks.com/post/indexing-with-solrj/
>
>
>
>> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch <[hidden email]> wrote:
>>
>> Again, I think you are best to do it out of Solr.
>>
>> But even of you want to get it to work in Solr, I think you start by
>> getting it to work directly in Tika. Then, get the missing libraries and
>> configuration into Solr.
>>
>> Regards,
>>   Alex
>>
>> On Wed, Oct 23, 2019, 7:08 PM suresh pendap, <[hidden email]> wrote:
>>
>>> Hi Alex,
>>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>>> to implement Custom update processor or extend the
>>> ExtractingRequestProcessor?
>>>
>>> Regards
>>> Suresh
>>>
>>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]
>>>>
>>> wrote:
>>>
>>>> I believe Tika that powers this can do so with extra libraries
>>> (tesseract?)
>>>> But Solr does not bundle those extras.
>>>>
>>>> In any case, you may want to run Tika externally to avoid the
>>>> conversion/extraction process be a burden to Solr itself.
>>>>
>>>> Regards,
>>>>    Alex
>>>>
>>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I am reading the Solr documentation about integration with Tika and
>>> Solr
>>>>> Cell framework over here
>>>>>
>>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>>>>>
>>>>> I would like to know if the can Solr Cell framework also be used to
>>>> extract
>>>>> text from the image files?
>>>>>
>>>>> Regards
>>>>> Suresh
>>>>>
>>>>
>>>
>

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Edward Ribeiro
In reply to this post by sureshpendap
No. You should install tesseract-ocr on the same box your Solr instance is,
and configure Solr so that embedded Tika is able to use Tesseract to do the
ocr of images.

Best,
Edward

Em qua, 23 de out de 2019 20:08, suresh pendap <[hidden email]>
escreveu:

> Hi Alex,
> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
> to implement Custom update processor or extend the
> ExtractingRequestProcessor?
>
> Regards
> Suresh
>
> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]
> >
> wrote:
>
> > I believe Tika that powers this can do so with extra libraries
> (tesseract?)
> > But Solr does not bundle those extras.
> >
> > In any case, you may want to run Tika externally to avoid the
> > conversion/extraction process be a burden to Solr itself.
> >
> > Regards,
> >      Alex
> >
> > On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
> > wrote:
> >
> > > Hello,
> > > I am reading the Solr documentation about integration with Tika and
> Solr
> > > Cell framework over here
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> > >
> > > I would like to know if the can Solr Cell framework also be used to
> > extract
> > > text from the image files?
> > >
> > > Regards
> > > Suresh
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Erick Erickson
I would do neither. I’d put it all on an external server and use _that_, then send
the finished docs to Solr.

The problem with putting this all on Solr is at least three-fold:
1> you’re talking heavy-duty work here to do the OCR, which takes away from the available resources for searching and indexing
2> any problems with either one will potentially blow up Solr
3> If you’re processing very many docs, you’ll have to parallelize somehow

Here’s the long form:
https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

> On Oct 26, 2019, at 12:37 PM, Edward Ribeiro <[hidden email]> wrote:
>
> No. You should install tesseract-ocr on the same box your Solr instance is,
> and configure Solr so that embedded Tika is able to use Tesseract to do the
> ocr of images.
>
> Best,
> Edward
>
> Em qua, 23 de out de 2019 20:08, suresh pendap <[hidden email]>
> escreveu:
>
>> Hi Alex,
>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>> to implement Custom update processor or extend the
>> ExtractingRequestProcessor?
>>
>> Regards
>> Suresh
>>
>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]
>>>
>> wrote:
>>
>>> I believe Tika that powers this can do so with extra libraries
>> (tesseract?)
>>> But Solr does not bundle those extras.
>>>
>>> In any case, you may want to run Tika externally to avoid the
>>> conversion/extraction process be a burden to Solr itself.
>>>
>>> Regards,
>>>     Alex
>>>
>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
>>> wrote:
>>>
>>>> Hello,
>>>> I am reading the Solr documentation about integration with Tika and
>> Solr
>>>> Cell framework over here
>>>>
>>>>
>>>
>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>>>>
>>>> I would like to know if the can Solr Cell framework also be used to
>>> extract
>>>> text from the image files?
>>>>
>>>> Regards
>>>> Suresh
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Jörn Franke
Maybe some additional consideration:
If you need to upgrade Solr then eventually you need to reindex.
If you change fields or add fields then you need to reindex.
Both are much faster if you have an external program that converts rich documents (pdf, word, ocr) to Text once and you use the text  (or hypertext if you need to keep headings etc) for reindexing. This will save you a lot of time - especially for large collections.

> Am 27.10.2019 um 15:13 schrieb Erick Erickson <[hidden email]>:
>
> I would do neither. I’d put it all on an external server and use _that_, then send
> the finished docs to Solr.
>
> The problem with putting this all on Solr is at least three-fold:
> 1> you’re talking heavy-duty work here to do the OCR, which takes away from the available resources for searching and indexing
> 2> any problems with either one will potentially blow up Solr
> 3> If you’re processing very many docs, you’ll have to parallelize somehow
>
> Here’s the long form:
> https://lucidworks.com/post/indexing-with-solrj/
>
> Best,
> Erick
>
>> On Oct 26, 2019, at 12:37 PM, Edward Ribeiro <[hidden email]> wrote:
>>
>> No. You should install tesseract-ocr on the same box your Solr instance is,
>> and configure Solr so that embedded Tika is able to use Tesseract to do the
>> ocr of images.
>>
>> Best,
>> Edward
>>
>> Em qua, 23 de out de 2019 20:08, suresh pendap <[hidden email]>
>> escreveu:
>>
>>> Hi Alex,
>>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>>> to implement Custom update processor or extend the
>>> ExtractingRequestProcessor?
>>>
>>> Regards
>>> Suresh
>>>
>>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]
>>>>
>>> wrote:
>>>
>>>> I believe Tika that powers this can do so with extra libraries
>>> (tesseract?)
>>>> But Solr does not bundle those extras.
>>>>
>>>> In any case, you may want to run Tika externally to avoid the
>>>> conversion/extraction process be a burden to Solr itself.
>>>>
>>>> Regards,
>>>>    Alex
>>>>
>>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I am reading the Solr documentation about integration with Tika and
>>> Solr
>>>>> Cell framework over here
>>>>>
>>>>>
>>>>
>>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>>>>>
>>>>> I would like to know if the can Solr Cell framework also be used to
>>>> extract
>>>>> text from the image files?
>>>>>
>>>>> Regards
>>>>> Suresh
>>>>>
>>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Retro-2
In reply to this post by Edward Ribeiro
Hello, can you please advise me, how to configure Solr so that embedded Tika
is able to use Tesseract to do the  ocr of images? I have installed the
following software -
SOLR      - 7.4.0
Tesseract - 4.1.1-rc2-20-g01fb
TIKA       - TIKA 1.18
Tesseract is installed in to the following directory:
/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0

Command “tesseract test.jpg  test.txt”  produces accurate txt file with
OCRed content from test.jpg
Current setup allows us to index attachments such like structured text files
(txt, word, pdf, etc), but does not react in any way for attachments like
png, jpg. Nor it works if uploaded directly to SOLR using its web interface.

Necessary modifications were made to the following files:
solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
PDFparser.properties.

Would appreciate if someone helped me with this configuration.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Jörn Franke
Have you checked this?

https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

> Am 17.01.2020 um 10:54 schrieb Retro <[hidden email]>:
>
> Hello, can you please advise me, how to configure Solr so that embedded Tika
> is able to use Tesseract to do the  ocr of images? I have installed the
> following software -
> SOLR      - 7.4.0
> Tesseract - 4.1.1-rc2-20-g01fb
> TIKA       - TIKA 1.18
> Tesseract is installed in to the following directory:
> /usr/share/tesseract/4/tessdata/
> echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
> tesseract -v
> tesseract 4.1.1-rc2-20-g01fb
> leptonica-1.76.0
>
> Command “tesseract test.jpg  test.txt”  produces accurate txt file with
> OCRed content from test.jpg
> Current setup allows us to index attachments such like structured text files
> (txt, word, pdf, etc), but does not react in any way for attachments like
> png, jpg. Nor it works if uploaded directly to SOLR using its web interface.
>
> Necessary modifications were made to the following files:
> solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
> PDFparser.properties.
>
> Would appreciate if someone helped me with this configuration.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Marco Reis-2
Are you intending to use the solution in production? If so, combining Tika
and Tesseract on the same server could not be a good choice.
Tika and Tesseract are heavy processing consumers, harming the main service
on the solution, in your case, Solr service.
I had the same situation here, and the combination Tika/Tesseract in the
production server does not scale, once I have many text documents and
images.
An alternative is to use a microservice to text preprocessing and another
one to OCR. You can take some ideas from https://github.com/tleyden/open-ocr
.
I have a separated Kubernetes cluster just for this, to extract and OCR
text from binary documents. Now, I can scale to a world-class solution.

Marco Reis
Software Engineer
http://marcoreis.net
+55 61 981194620



On Fri, 17 Jan 2020 at 07:17, Jörn Franke <[hidden email]> wrote:

> Have you checked this?
>
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
>
> > Am 17.01.2020 um 10:54 schrieb Retro <[hidden email]>:
> >
> > Hello, can you please advise me, how to configure Solr so that embedded
> Tika
> > is able to use Tesseract to do the  ocr of images? I have installed the
> > following software -
> > SOLR      - 7.4.0
> > Tesseract - 4.1.1-rc2-20-g01fb
> > TIKA       - TIKA 1.18
> > Tesseract is installed in to the following directory:
> > /usr/share/tesseract/4/tessdata/
> > echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
> > tesseract -v
> > tesseract 4.1.1-rc2-20-g01fb
> > leptonica-1.76.0
> >
> > Command “tesseract test.jpg  test.txt”  produces accurate txt file with
> > OCRed content from test.jpg
> > Current setup allows us to index attachments such like structured text
> files
> > (txt, word, pdf, etc), but does not react in any way for attachments like
> > png, jpg. Nor it works if uploaded directly to SOLR using its web
> interface.
> >
> > Necessary modifications were made to the following files:
> > solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
> > PDFparser.properties.
> >
> > Would appreciate if someone helped me with this configuration.
> >
> >
> >
> > --
> > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Retro-2
In reply to this post by Jörn Franke
Yes, I did. this manual is referring to standalone version of TIKA, while I
have a build-in version.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Retro-2
In reply to this post by Marco Reis-2
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to
use it in production, but on a longer run. For the moment I just need to
make it work as a test case.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

Retro-2
In reply to this post by Retro-2
Good day,
We solved the situation. Here is what was used and changed:
In our installation we used Tesseract  version 3.05, Tika version 1.17, SOLR
version 7.4.  We actually, had TIKA version 1.17, not 18.
1. Changed from HOCR to TXT  >>>   <property name="outputType" value="TXT"/>  
in file parseContext.xml
2. Had to start SOLR as a root user.
Version 4.1.1 is not compatible with TIKA 1.17 , so we will upgrade SOLR to
version 7.7, TIKA version 1.19 and will try to install Tesseract 4.1.1
<https://lucene.472066.n3.nabble.com/file/t495209/Capture.png>



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: regarding Extracting text from Images

s_ge
In reply to this post by sureshpendap
In my experience, enabling Tika at server level can result in memory heap space used up under high volume of extraction, and bring down Solr entirely.   Likely due to garbage collector not able to keep up w/ load, even tuning garbage collector didn't resolve the problem completely.  Not recommend.
Steve  
 
  On Wed, Oct 23, 2019 at 7:08 PM, suresh pendap<[hidden email]> wrote:   Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?

Regards
Suresh

On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <[hidden email]>
wrote:

> I believe Tika that powers this can do so with extra libraries (tesseract?)
> But Solr does not bundle those extras.
>
> In any case, you may want to run Tika externally to avoid the
> conversion/extraction process be a burden to Solr itself.
>
> Regards,
>      Alex
>
> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <[hidden email]>
> wrote:
>
> > Hello,
> > I am reading the Solr documentation about integration with Tika and Solr
> > Cell framework over here
> >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > I would like to know if the can Solr Cell framework also be used to
> extract
> > text from the image files?
> >
> > Regards
> > Suresh
> >
>