Fwd: configuring Solr with Tesseract

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: configuring Solr with Tesseract

Admin eLawJournal
Hi,
I have read that we can use tesseract with solr to index image files. I
would like some guidance on setting this up.

Currently, I am using solr for searching my wordpress installation via the
WPSOLR plugin.

I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
wordpress.

I have also installed tesseract but have no clue on configuring it.


I am new to solr so will greatly appreciate a detailed step by step
instruction.

Thank you very much
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: configuring Solr with Tesseract

Charlie Hull-3
On 03/11/2017 15:32, Admin eLawJournal wrote:

> Hi,
> I have read that we can use tesseract with solr to index image files. I
> would like some guidance on setting this up.
>
> Currently, I am using solr for searching my wordpress installation via the
> WPSOLR plugin.
>
> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> wordpress.
>
> I have also installed tesseract but have no clue on configuring it.
>
>
> I am new to solr so will greatly appreciate a detailed step by step
> instruction.

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you
probably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tika
https://wiki.apache.org/tika/TikaOCR which is an awesome library for
extracting plain text from many different document formats and types.
There's a direct way to use Tesseract from within Solr (the
ExtractingRequestHandler
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika)
but we don't generally recommend this, as dodgy files can sometimes eat
all your resources during parsing and if Tika dies then so does Solr. We
usually process the files externally and the feed them to Solr using its
HTTP API.

Here's one way to do it - a simple server wrapper around Tika
https://github.com/mattflax/dropwizard-tika-server written by my
colleague Matt Pearce.

So you're going to need to do some coding I think - Python would be a
good choice - to feed your source files to Tika for OCR and extraction,
and then the resulting text to Solr for indexing.

Cheers

Charlie

>
> Thank you very much
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: configuring Solr with Tesseract

Admin eLawJournal
Hi Charlie,

Thanks for the reply. You're right. I haven't got my hands dirty with solr
yet. I am not from an IT background and learnt everything I know through
lots of reading online. However, all the documentation on solr assumes that
the reader has advanced IT knowledge. In fact, it took me a week to learn
to install and configure solr index to work with WordPress.

Getting solr to ocr appears to be beyond me. And I can't code.

*Would you consider setting this up for me for a fee? *

And also with a step by step guide for dummies in case I intend to upgrade
in the future.

I also noticed that Tika 1.14 is capable of ocr by itself. I would be okay
with a setup of solr using Tika 1.14 to ocr the PDF if that is possible.

Best regards,
Anand


On Nov 6, 2017 5:05 PM, "Charlie Hull" <[hidden email]> wrote:

On 03/11/2017 15:32, Admin eLawJournal wrote:

> Hi,
> I have read that we can use tesseract with solr to index image files. I
> would like some guidance on setting this up.
>
> Currently, I am using solr for searching my wordpress installation via the
> WPSOLR plugin.
>
> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> wordpress.
>
> I have also installed tesseract but have no clue on configuring it.
>
>
> I am new to solr so will greatly appreciate a detailed step by step
> instruction.
>

Hi,

I'm guessing if you're using a preconfigured Solr plugin for WP you
probably haven't got your hands properly dirty with Solr yet.

One way to use Tesseract would be via Apache Tika
https://wiki.apache.org/tika/TikaOCR which is an awesome library for
extracting plain text from many different document formats and types.
There's a direct way to use Tesseract from within Solr (the
ExtractingRequestHandler https://lucene.apache.org/solr
/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.
html#uploading-data-with-solr-cell-using-apache-tika) but we don't
generally recommend this, as dodgy files can sometimes eat all your
resources during parsing and if Tika dies then so does Solr. We usually
process the files externally and the feed them to Solr using its HTTP API.

Here's one way to do it - a simple server wrapper around Tika
https://github.com/mattflax/dropwizard-tika-server written by my colleague
Matt Pearce.

So you're going to need to do some coding I think - Python would be a good
choice - to feed your source files to Tika for OCR and extraction, and then
the resulting text to Solr for indexing.

Cheers

Charlie


> Thank you very much
>
>

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: configuring Solr with Tesseract

Rick Leir-2
In reply to this post by Charlie Hull-3
Anand,
As Charlie says you should have a separate process for this. Also, if you go back about ten months in this mailing list you will see some discussion about how OCR can take minutes of CPU per page, and needs some preprocessing with Imagemagick or Graphicsmagick. You will want to do some fine tuning with this, then save your OCR output in a DB or the filesystem. Then you will want to be able to re-index Solr easily as you fine tune Solr.

Yes, use Python or your preferred Scripting language.
Cheers -- Rick

On November 6, 2017 4:05:42 AM EST, Charlie Hull <[hidden email]> wrote:

>On 03/11/2017 15:32, Admin eLawJournal wrote:
>> Hi,
>> I have read that we can use tesseract with solr to index image files.
>I
>> would like some guidance on setting this up.
>>
>> Currently, I am using solr for searching my wordpress installation
>via the
>> WPSOLR plugin.
>>
>> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
>> wordpress.
>>
>> I have also installed tesseract but have no clue on configuring it.
>>
>>
>> I am new to solr so will greatly appreciate a detailed step by step
>> instruction.
>
>Hi,
>
>I'm guessing if you're using a preconfigured Solr plugin for WP you
>probably haven't got your hands properly dirty with Solr yet.
>
>One way to use Tesseract would be via Apache Tika
>https://wiki.apache.org/tika/TikaOCR which is an awesome library for
>extracting plain text from many different document formats and types.
>There's a direct way to use Tesseract from within Solr (the
>ExtractingRequestHandler
>https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika)
>
>but we don't generally recommend this, as dodgy files can sometimes eat
>
>all your resources during parsing and if Tika dies then so does Solr.
>We
>usually process the files externally and the feed them to Solr using
>its
>HTTP API.
>
>Here's one way to do it - a simple server wrapper around Tika
>https://github.com/mattflax/dropwizard-tika-server written by my
>colleague Matt Pearce.
>
>So you're going to need to do some coding I think - Python would be a
>good choice - to feed your source files to Tika for OCR and extraction,
>
>and then the resulting text to Solr for indexing.
>
>Cheers
>
>Charlie
>
>>
>> Thank you very much
>>
>
>
>--
>Charlie Hull
>Flax - Open Source Enterprise Search
>
>tel/fax: +44 (0)8700 118334
>mobile:  +44 (0)7767 825828
>web: www.flax.co.uk

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: configuring Solr with Tesseract

Admin eLawJournal
Thanks Rick, minutes of CPU is definitely going to break my site. I'm
looking for someone to hire as I have no coding knowledge. Please let me
know if you are up for it.

On Mon, Nov 6, 2017 at 8:05 PM, Rick Leir <[hidden email]> wrote:

> Anand,
> As Charlie says you should have a separate process for this. Also, if you
> go back about ten months in this mailing list you will see some discussion
> about how OCR can take minutes of CPU per page, and needs some
> preprocessing with Imagemagick or Graphicsmagick. You will want to do some
> fine tuning with this, then save your OCR output in a DB or the filesystem.
> Then you will want to be able to re-index Solr easily as you fine tune Solr.
>
> Yes, use Python or your preferred Scripting language.
> Cheers -- Rick
>
> On November 6, 2017 4:05:42 AM EST, Charlie Hull <[hidden email]>
> wrote:
> >On 03/11/2017 15:32, Admin eLawJournal wrote:
> >> Hi,
> >> I have read that we can use tesseract with solr to index image files.
> >I
> >> would like some guidance on setting this up.
> >>
> >> Currently, I am using solr for searching my wordpress installation
> >via the
> >> WPSOLR plugin.
> >>
> >> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
> >> wordpress.
> >>
> >> I have also installed tesseract but have no clue on configuring it.
> >>
> >>
> >> I am new to solr so will greatly appreciate a detailed step by step
> >> instruction.
> >
> >Hi,
> >
> >I'm guessing if you're using a preconfigured Solr plugin for WP you
> >probably haven't got your hands properly dirty with Solr yet.
> >
> >One way to use Tesseract would be via Apache Tika
> >https://wiki.apache.org/tika/TikaOCR which is an awesome library for
> >extracting plain text from many different document formats and types.
> >There's a direct way to use Tesseract from within Solr (the
> >ExtractingRequestHandler
> >https://lucene.apache.org/solr/guide/6_6/uploading-data-
> with-solr-cell-using-apache-tika.html#uploading-data-with-
> solr-cell-using-apache-tika)
> >
> >but we don't generally recommend this, as dodgy files can sometimes eat
> >
> >all your resources during parsing and if Tika dies then so does Solr.
> >We
> >usually process the files externally and the feed them to Solr using
> >its
> >HTTP API.
> >
> >Here's one way to do it - a simple server wrapper around Tika
> >https://github.com/mattflax/dropwizard-tika-server written by my
> >colleague Matt Pearce.
> >
> >So you're going to need to do some coding I think - Python would be a
> >good choice - to feed your source files to Tika for OCR and extraction,
> >
> >and then the resulting text to Solr for indexing.
> >
> >Cheers
> >
> >Charlie
> >
> >>
> >> Thank you very much
> >>
> >
> >
> >--
> >Charlie Hull
> >Flax - Open Source Enterprise Search
> >
> >tel/fax: +44 (0)8700 118334
> >mobile:  +44 (0)7767 825828
> >web: www.flax.co.uk
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: configuring Solr with Tesseract

lala
In reply to this post by Rick Leir-2
Hi, can you please point me out to "the discussion about how OCR can take
minutes of CPU per page", I really need to understand more the Tika OCR
behavior with solr.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html