TIKA OCR not working

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

TIKA OCR not working

trung.ht
Hi,

I want to use solr to index some scanned document, after settings solr document with a two field "content" and "filename", I tried to upload the attached file, but it seems that the content of the file is only "\n \n \n....". 
But if I used the tesseract from command line I got the result correctly.

The log when solr receive my request:
-----------
INFO  - 2015-04-23 03:49:25.941; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.groupid=2&json.nl=flat&resource.name=phplNiPrs&literal.id=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
------------

The document when I check on solr admin page:
-------------
{ "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate": "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png", "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": " \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ", "_version_": 1499213034586898400 }
-----------

Since I am a solr newbie I do not know where to look, can anyone give me an advice for where to look for error or settings to make it work.
Thanks in advanced.

Trung.
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Ahmet Arslan
Hi Trung,

solr-cell (tika) does not do OCR. It cannot exact text from image based pdfs.

Ahmet



On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]> wrote:



Hi,

I want to use solr to index some scanned document, after settings solr document with a two field "content" and "filename", I tried to upload the attached file, but it seems that the content of the file is only "\n \n \n....".
But if I used the tesseract from command line I got the result correctly.

The log when solr receive my request:
-----------
INFO  - 2015-04-23 03:49:25.941; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update/extract params={literal.groupid=2&json.nl=flat&resource.name=phplNiPrs&literal.id=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}

------------

The document when I check on solr admin page:
-------------
{ "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate": "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png", "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": " \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ", "_version_": 1499213034586898400 }

-----------

Since I am a solr newbie I do not know where to look, can anyone give me an advice for where to look for error or settings to make it work.
Thanks in advanced.

Trung.
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

trung.ht
Hi Ahmet,

I used a png file, not a pdf file. From the document, I understand that
solr will post the file to tika, and since tika 1.7, OCR is included. Is
there something I misunderstood.

Trung.

On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <[hidden email]>
wrote:

> Hi Trung,
>
> solr-cell (tika) does not do OCR. It cannot exact text from image based
> pdfs.
>
> Ahmet
>
>
>
> On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]> wrote:
>
>
>
> Hi,
>
> I want to use solr to index some scanned document, after settings solr
> document with a two field "content" and "filename", I tried to upload the
> attached file, but it seems that the content of the file is only "\n \n
> \n....".
> But if I used the tesseract from command line I got the result correctly.
>
> The log when solr receive my request:
> -----------
> INFO  - 2015-04-23 03:49:25.941;
> org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> webapp=/solr path=/update/extract params={literal.groupid=2&json.nl=flat&
> resource.name=phplNiPrs&literal.id
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>
> ------------
>
> The document when I check on solr admin page:
> -------------
> { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate":
> "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png",
> "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": "
> \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ",
> "_version_": 1499213034586898400 }
>
> -----------
>
> Since I am a solr newbie I do not know where to look, can anyone give me
> an advice for where to look for error or settings to make it work.
> Thanks in advanced.
>
> Trung.
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Ahmet Arslan
Hi Trung,

I didn't know about OCR capabilities of tika.
Someone who is familiar with sold-cell can inform us whether this functionality is added to solr or not.

Ahmet



On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]> wrote:
Hi Ahmet,

I used a png file, not a pdf file. From the document, I understand that
solr will post the file to tika, and since tika 1.7, OCR is included. Is
there something I misunderstood.

Trung.


On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <[hidden email]>
wrote:

> Hi Trung,
>
> solr-cell (tika) does not do OCR. It cannot exact text from image based
> pdfs.
>
> Ahmet
>
>
>
> On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]> wrote:
>
>
>
> Hi,
>
> I want to use solr to index some scanned document, after settings solr
> document with a two field "content" and "filename", I tried to upload the
> attached file, but it seems that the content of the file is only "\n \n
> \n....".
> But if I used the tesseract from command line I got the result correctly.
>
> The log when solr receive my request:
> -----------
> INFO  - 2015-04-23 03:49:25.941;
> org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> webapp=/solr path=/update/extract params={literal.groupid=2&json.nl=flat&
> resource.name=phplNiPrs&literal.id
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>
> ------------
>
> The document when I check on solr admin page:
> -------------
> { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate":
> "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png",
> "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ], "content": "
> \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ",
> "_version_": 1499213034586898400 }
>
> -----------
>
> Since I am a solr newbie I do not know where to look, can anyone give me
> an advice for where to look for error or settings to make it work.
> Thanks in advanced.
>
> Trung.
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Alexandre Rafalovitch
I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen it
in use yet.

Regards,
    Alex
On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]> wrote:

> Hi Trung,
>
> I didn't know about OCR capabilities of tika.
> Someone who is familiar with sold-cell can inform us whether this
> functionality is added to solr or not.
>
> Ahmet
>
>
>
> On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]> wrote:
> Hi Ahmet,
>
> I used a png file, not a pdf file. From the document, I understand that
> solr will post the file to tika, and since tika 1.7, OCR is included. Is
> there something I misunderstood.
>
> Trung.
>
>
> On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <[hidden email]>
> wrote:
>
> > Hi Trung,
> >
> > solr-cell (tika) does not do OCR. It cannot exact text from image based
> > pdfs.
> >
> > Ahmet
> >
> >
> >
> > On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]> wrote:
> >
> >
> >
> > Hi,
> >
> > I want to use solr to index some scanned document, after settings solr
> > document with a two field "content" and "filename", I tried to upload the
> > attached file, but it seems that the content of the file is only "\n \n
> > \n....".
> > But if I used the tesseract from command line I got the result correctly.
> >
> > The log when solr receive my request:
> > -----------
> > INFO  - 2015-04-23 03:49:25.941;
> > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
> =flat&
> > resource.name=phplNiPrs&literal.id
> >
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> >
> > ------------
> >
> > The document when I check on solr admin page:
> > -------------
> > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate":
> > "2015-04-22T15:00:00Z", "filename": "\\\\trunght\\test\\tesseract_3.png",
> > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> "content": "
> > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n  ",
> > "_version_": 1499213034586898400 }
> >
> > -----------
> >
> > Since I am a solr newbie I do not know where to look, can anyone give me
> > an advice for where to look for error or settings to make it work.
> > Thanks in advanced.
> >
> > Trung.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Jack Krupansky-3
It's not clear if OCR would happen automatically in Solr Cell, or if
changes to Solr would be needed.

For Tika OCR info, see:

https://issues.apache.org/jira/browse/TIKA-93
https://wiki.apache.org/tika/TikaOCR



-- Jack Krupansky

On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <[hidden email]>
wrote:

> I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen it
> in use yet.
>
> Regards,
>     Alex
> On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]> wrote:
>
> > Hi Trung,
> >
> > I didn't know about OCR capabilities of tika.
> > Someone who is familiar with sold-cell can inform us whether this
> > functionality is added to solr or not.
> >
> > Ahmet
> >
> >
> >
> > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]> wrote:
> > Hi Ahmet,
> >
> > I used a png file, not a pdf file. From the document, I understand that
> > solr will post the file to tika, and since tika 1.7, OCR is included. Is
> > there something I misunderstood.
> >
> > Trung.
> >
> >
> > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan <[hidden email]
> >
> > wrote:
> >
> > > Hi Trung,
> > >
> > > solr-cell (tika) does not do OCR. It cannot exact text from image based
> > > pdfs.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]>
> wrote:
> > >
> > >
> > >
> > > Hi,
> > >
> > > I want to use solr to index some scanned document, after settings solr
> > > document with a two field "content" and "filename", I tried to upload
> the
> > > attached file, but it seems that the content of the file is only "\n \n
> > > \n....".
> > > But if I used the tesseract from command line I got the result
> correctly.
> > >
> > > The log when solr receive my request:
> > > -----------
> > > INFO  - 2015-04-23 03:49:25.941;
> > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
> > =flat&
> > > resource.name=phplNiPrs&literal.id
> > >
> >
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> > >
> > > ------------
> > >
> > > The document when I check on solr admin page:
> > > -------------
> > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3, "createddate":
> > > "2015-04-22T15:00:00Z", "filename":
> "\\\\trunght\\test\\tesseract_3.png",
> > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> > "content": "
> > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> \n
> > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n
> ",
> > > "_version_": 1499213034586898400 }
> > >
> > > -----------
> > >
> > > Since I am a solr newbie I do not know where to look, can anyone give
> me
> > > an advice for where to look for error or settings to make it work.
> > > Thanks in advanced.
> > >
> > > Trung.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

trung.ht
Hi Jack, Alexandre,

Thanks for answering.
I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
but it looks like it does not work. Does anyone know that TIKA OCR works
automatically with Solr or I have to change some settings?

Trung.




On Thu, Apr 23, 2015 at 10:02 PM, Jack Krupansky <[hidden email]>
wrote:

> It's not clear if OCR would happen automatically in Solr Cell, or if
> changes to Solr would be needed.
>
> For Tika OCR info, see:
>
> https://issues.apache.org/jira/browse/TIKA-93
> https://wiki.apache.org/tika/TikaOCR
>
>
>
> -- Jack Krupansky
>
> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <[hidden email]
> >
> wrote:
>
> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
> it
> > in use yet.
> >
> > Regards,
> >     Alex
> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]>
> wrote:
> >
> > > Hi Trung,
> > >
> > > I didn't know about OCR capabilities of tika.
> > > Someone who is familiar with sold-cell can inform us whether this
> > > functionality is added to solr or not.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]>
> wrote:
> > > Hi Ahmet,
> > >
> > > I used a png file, not a pdf file. From the document, I understand that
> > > solr will post the file to tika, and since tika 1.7, OCR is included.
> Is
> > > there something I misunderstood.
> > >
> > > Trung.
> > >
> > >
> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> <[hidden email]
> > >
> > > wrote:
> > >
> > > > Hi Trung,
> > > >
> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
> based
> > > > pdfs.
> > > >
> > > > Ahmet
> > > >
> > > >
> > > >
> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]>
> > wrote:
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I want to use solr to index some scanned document, after settings
> solr
> > > > document with a two field "content" and "filename", I tried to upload
> > the
> > > > attached file, but it seems that the content of the file is only "\n
> \n
> > > > \n....".
> > > > But if I used the tesseract from command line I got the result
> > correctly.
> > > >
> > > > The log when solr receive my request:
> > > > -----------
> > > > INFO  - 2015-04-23 03:49:25.941;
> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
> > > =flat&
> > > > resource.name=phplNiPrs&literal.id
> > > >
> > >
> >
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> > > >
> > > > ------------
> > > >
> > > > The document when I check on solr admin page:
> > > > -------------
> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> "createddate":
> > > > "2015-04-22T15:00:00Z", "filename":
> > "\\\\trunght\\test\\tesseract_3.png",
> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> > > "content": "
> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > \n
> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n
> > ",
> > > > "_version_": 1499213034586898400 }
> > > >
> > > > -----------
> > > >
> > > > Since I am a solr newbie I do not know where to look, can anyone give
> > me
> > > > an advice for where to look for error or settings to make it work.
> > > > Thanks in advanced.
> > > >
> > > > Trung.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

trung.ht
HI everyone,

Does anyone have the answer for this problem :)?


I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> but it looks like it does not work. Does anyone know that TIKA OCR works
> automatically with Solr or I have to change some settings?
>
>>
Trung.


> It's not clear if OCR would happen automatically in Solr Cell, or if
>> changes to Solr would be needed.
>>
>> For Tika OCR info, see:
>>
>> https://issues.apache.org/jira/browse/TIKA-93
>> https://wiki.apache.org/tika/TikaOCR
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> [hidden email]>
>> wrote:
>>
>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
>> it
>> > in use yet.
>> >
>> > Regards,
>> >     Alex
>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]>
>> wrote:
>> >
>> > > Hi Trung,
>> > >
>> > > I didn't know about OCR capabilities of tika.
>> > > Someone who is familiar with sold-cell can inform us whether this
>> > > functionality is added to solr or not.
>> > >
>> > > Ahmet
>> > >
>> > >
>> > >
>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]>
>> wrote:
>> > > Hi Ahmet,
>> > >
>> > > I used a png file, not a pdf file. From the document, I understand
>> that
>> > > solr will post the file to tika, and since tika 1.7, OCR is included.
>> Is
>> > > there something I misunderstood.
>> > >
>> > > Trung.
>> > >
>> > >
>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> <[hidden email]
>> > >
>> > > wrote:
>> > >
>> > > > Hi Trung,
>> > > >
>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>> based
>> > > > pdfs.
>> > > >
>> > > > Ahmet
>> > > >
>> > > >
>> > > >
>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]>
>> > wrote:
>> > > >
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I want to use solr to index some scanned document, after settings
>> solr
>> > > > document with a two field "content" and "filename", I tried to
>> upload
>> > the
>> > > > attached file, but it seems that the content of the file is only
>> "\n \n
>> > > > \n....".
>> > > > But if I used the tesseract from command line I got the result
>> > correctly.
>> > > >
>> > > > The log when solr receive my request:
>> > > > -----------
>> > > > INFO  - 2015-04-23 03:49:25.941;
>> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
>> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
>> > > =flat&
>> > > > resource.name=phplNiPrs&literal.id
>> > > >
>> > >
>> >
>> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > > >
>> > > > ------------
>> > > >
>> > > > The document when I check on solr admin page:
>> > > > -------------
>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> "createddate":
>> > > > "2015-04-22T15:00:00Z", "filename":
>> > "\\\\trunght\\test\\tesseract_3.png",
>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> > > "content": "
>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > \n
>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > ",
>> > > > "_version_": 1499213034586898400 }
>> > > >
>> > > > -----------
>> > > >
>> > > > Since I am a solr newbie I do not know where to look, can anyone
>> give
>> > me
>> > > > an advice for where to look for error or settings to make it work.
>> > > > Thanks in advanced.
>> > > >
>> > > > Trung.
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

FW: TIKA OCR not working

Allison, Timothy B.
Trung,

I haven't experimented with our OCR parser yet, but this should give a good start: https://wiki.apache.org/tika/TikaOCR .

Have you installed tesseract?

Tika colleagues,
  Any other tips?  What else has to be configured and how?

-----Original Message-----
From: trung.ht [mailto:[hidden email]]
Sent: Friday, April 24, 2015 11:22 PM
To: [hidden email]
Subject: Re: TIKA OCR not working

HI everyone,

Does anyone have the answer for this problem :)?


I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> but it looks like it does not work. Does anyone know that TIKA OCR works
> automatically with Solr or I have to change some settings?
>
>>
Trung.


> It's not clear if OCR would happen automatically in Solr Cell, or if
>> changes to Solr would be needed.
>>
>> For Tika OCR info, see:
>>
>> https://issues.apache.org/jira/browse/TIKA-93
>> https://wiki.apache.org/tika/TikaOCR
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> [hidden email]>
>> wrote:
>>
>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't seen
>> it
>> > in use yet.
>> >
>> > Regards,
>> >     Alex
>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]>
>> wrote:
>> >
>> > > Hi Trung,
>> > >
>> > > I didn't know about OCR capabilities of tika.
>> > > Someone who is familiar with sold-cell can inform us whether this
>> > > functionality is added to solr or not.
>> > >
>> > > Ahmet
>> > >
>> > >
>> > >
>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]>
>> wrote:
>> > > Hi Ahmet,
>> > >
>> > > I used a png file, not a pdf file. From the document, I understand
>> that
>> > > solr will post the file to tika, and since tika 1.7, OCR is included.
>> Is
>> > > there something I misunderstood.
>> > >
>> > > Trung.
>> > >
>> > >
>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> <[hidden email]
>> > >
>> > > wrote:
>> > >
>> > > > Hi Trung,
>> > > >
>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>> based
>> > > > pdfs.
>> > > >
>> > > > Ahmet
>> > > >
>> > > >
>> > > >
>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]>
>> > wrote:
>> > > >
>> > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > I want to use solr to index some scanned document, after settings
>> solr
>> > > > document with a two field "content" and "filename", I tried to
>> upload
>> > the
>> > > > attached file, but it seems that the content of the file is only
>> "\n \n
>> > > > \n....".
>> > > > But if I used the tesseract from command line I got the result
>> > correctly.
>> > > >
>> > > > The log when solr receive my request:
>> > > > -----------
>> > > > INFO  - 2015-04-23 03:49:25.941;
>> > > > org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
>> > > > webapp=/solr path=/update/extract params={literal.groupid=2&json.nl
>> > > =flat&
>> > > > resource.name=phplNiPrs&literal.id
>> > > >
>> > >
>> >
>> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > > >
>> > > > ------------
>> > > >
>> > > > The document when I check on solr admin page:
>> > > > -------------
>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> "createddate":
>> > > > "2015-04-22T15:00:00Z", "filename":
>> > "\\\\trunght\\test\\tesseract_3.png",
>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> > > "content": "
>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > \n
>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> \n
>> > ",
>> > > > "_version_": 1499213034586898400 }
>> > > >
>> > > > -----------
>> > > >
>> > > > Since I am a solr newbie I do not know where to look, can anyone
>> give
>> > me
>> > > > an advice for where to look for error or settings to make it work.
>> > > > Thanks in advanced.
>> > > >
>> > > > Trung.
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Mattmann, Chris A (3010)
It should work out of the box in Solr as long as Tesseract is
installed and on the class path. Solr had an issue with it since
Tika sends 2 startDocument calls, but I fixed that with Uwe and
it was shipped in 4.10.4 and in 5.x I think?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, April 27, 2015 at 10:26 AM
To: "[hidden email]" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, "[hidden email]"
<[hidden email]>
Subject: FW: TIKA OCR not working

>Trung,
>
>I haven't experimented with our OCR parser yet, but this should give a
>good start: https://wiki.apache.org/tika/TikaOCR .
>
>Have you installed tesseract?
>
>Tika colleagues,
>  Any other tips?  What else has to be configured and how?
>
>-----Original Message-----
>From: trung.ht [mailto:[hidden email]]
>Sent: Friday, April 24, 2015 11:22 PM
>To: [hidden email]
>Subject: Re: TIKA OCR not working
>
>HI everyone,
>
>Does anyone have the answer for this problem :)?
>
>
>I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>1.7,
>> but it looks like it does not work. Does anyone know that TIKA OCR works
>> automatically with Solr or I have to change some settings?
>>
>>>
>Trung.
>
>
>> It's not clear if OCR would happen automatically in Solr Cell, or if
>>> changes to Solr would be needed.
>>>
>>> For Tika OCR info, see:
>>>
>>> https://issues.apache.org/jira/browse/TIKA-93
>>> https://wiki.apache.org/tika/TikaOCR
>>>
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>>> [hidden email]>
>>> wrote:
>>>
>>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>>>seen
>>> it
>>> > in use yet.
>>> >
>>> > Regards,
>>> >     Alex
>>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]>
>>> wrote:
>>> >
>>> > > Hi Trung,
>>> > >
>>> > > I didn't know about OCR capabilities of tika.
>>> > > Someone who is familiar with sold-cell can inform us whether this
>>> > > functionality is added to solr or not.
>>> > >
>>> > > Ahmet
>>> > >
>>> > >
>>> > >
>>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]>
>>> wrote:
>>> > > Hi Ahmet,
>>> > >
>>> > > I used a png file, not a pdf file. From the document, I understand
>>> that
>>> > > solr will post the file to tika, and since tika 1.7, OCR is
>>>included.
>>> Is
>>> > > there something I misunderstood.
>>> > >
>>> > > Trung.
>>> > >
>>> > >
>>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>>> <[hidden email]
>>> > >
>>> > > wrote:
>>> > >
>>> > > > Hi Trung,
>>> > > >
>>> > > > solr-cell (tika) does not do OCR. It cannot exact text from image
>>> based
>>> > > > pdfs.
>>> > > >
>>> > > > Ahmet
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht <[hidden email]>
>>> > wrote:
>>> > > >
>>> > > >
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > I want to use solr to index some scanned document, after settings
>>> solr
>>> > > > document with a two field "content" and "filename", I tried to
>>> upload
>>> > the
>>> > > > attached file, but it seems that the content of the file is only
>>> "\n \n
>>> > > > \n....".
>>> > > > But if I used the tesseract from command line I got the result
>>> > correctly.
>>> > > >
>>> > > > The log when solr receive my request:
>>> > > > -----------
>>> > > > INFO  - 2015-04-23 03:49:25.941;
>>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>>>[collection1]
>>> > > > webapp=/solr path=/update/extract
>>>params={literal.groupid=2&json.nl
>>> > > =flat&
>>> > > > resource.name=phplNiPrs&literal.id
>>> > > >
>>> > >
>>> >
>>>
>>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&lit
>>>eral.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.content=cont
>>>ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>>> > > >
>>> > > > ------------
>>> > > >
>>> > > > The document when I check on solr admin page:
>>> > > > -------------
>>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>>> "createddate":
>>> > > > "2015-04-22T15:00:00Z", "filename":
>>> > "\\\\trunght\\test\\tesseract_3.png",
>>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>>> > > "content": "
>>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>> \n
>>> > \n
>>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>>\n
>>> \n
>>> > ",
>>> > > > "_version_": 1499213034586898400 }
>>> > > >
>>> > > > -----------
>>> > > >
>>> > > > Since I am a solr newbie I do not know where to look, can anyone
>>> give
>>> > me
>>> > > > an advice for where to look for error or settings to make it
>>>work.
>>> > > > Thanks in advanced.
>>> > > >
>>> > > > Trung.
>>> > > >
>>> > >
>>> >
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: TIKA OCR not working

Uwe Schindler
In reply to this post by Allison, Timothy B.
Hi,
TIKA OCR is definitely working automatically with Solr 5.x.

It is just important to install TesseractOCR on path (which is a native tool that does the actual work). On Ubuntu Linux, this should be quite simple ("apt-get install tesseract-ocr" or like that). You may also need to ainstall additional language for better results.

Unless you are on a Turkish localized machine (which causes a bug in the JDK on spawning external processes) and the native tools are installed, it should work OOB, no configuration needed. Please also check log files.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Allison, Timothy B. [mailto:[hidden email]]
> Sent: Monday, April 27, 2015 4:27 PM
> To: [hidden email]
> Cc: [hidden email]; [hidden email]
> Subject: FW: TIKA OCR not working
>
> Trung,
>
> I haven't experimented with our OCR parser yet, but this should give a good
> start: https://wiki.apache.org/tika/TikaOCR .
>
> Have you installed tesseract?
>
> Tika colleagues,
>   Any other tips?  What else has to be configured and how?
>
> -----Original Message-----
> From: trung.ht [mailto:[hidden email]]
> Sent: Friday, April 24, 2015 11:22 PM
> To: [hidden email]
> Subject: Re: TIKA OCR not working
>
> HI everyone,
>
> Does anyone have the answer for this problem :)?
>
>
> I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> > but it looks like it does not work. Does anyone know that TIKA OCR
> > works automatically with Solr or I have to change some settings?
> >
> >>
> Trung.
>
>
> > It's not clear if OCR would happen automatically in Solr Cell, or if
> >> changes to Solr would be needed.
> >>
> >> For Tika OCR info, see:
> >>
> >> https://issues.apache.org/jira/browse/TIKA-93
> >> https://wiki.apache.org/tika/TikaOCR
> >>
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
> >> [hidden email]>
> >> wrote:
> >>
> >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
> >> > seen
> >> it
> >> > in use yet.
> >> >
> >> > Regards,
> >> >     Alex
> >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]>
> >> wrote:
> >> >
> >> > > Hi Trung,
> >> > >
> >> > > I didn't know about OCR capabilities of tika.
> >> > > Someone who is familiar with sold-cell can inform us whether this
> >> > > functionality is added to solr or not.
> >> > >
> >> > > Ahmet
> >> > >
> >> > >
> >> > >
> >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]>
> >> wrote:
> >> > > Hi Ahmet,
> >> > >
> >> > > I used a png file, not a pdf file. From the document, I
> >> > > understand
> >> that
> >> > > solr will post the file to tika, and since tika 1.7, OCR is included.
> >> Is
> >> > > there something I misunderstood.
> >> > >
> >> > > Trung.
> >> > >
> >> > >
> >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> >> <[hidden email]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi Trung,
> >> > > >
> >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
> >> > > > image
> >> based
> >> > > > pdfs.
> >> > > >
> >> > > > Ahmet
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
> >> > > > <[hidden email]>
> >> > wrote:
> >> > > >
> >> > > >
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > I want to use solr to index some scanned document, after
> >> > > > settings
> >> solr
> >> > > > document with a two field "content" and "filename", I tried to
> >> upload
> >> > the
> >> > > > attached file, but it seems that the content of the file is
> >> > > > only
> >> "\n \n
> >> > > > \n....".
> >> > > > But if I used the tesseract from command line I got the result
> >> > correctly.
> >> > > >
> >> > > > The log when solr receive my request:
> >> > > > -----------
> >> > > > INFO  - 2015-04-23 03:49:25.941;
> >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
> >> > > > [collection1] webapp=/solr path=/update/extract
> >> > > > params={literal.groupid=2&json.nl
> >> > > =flat&
> >> > > > resource.name=phplNiPrs&literal.id
> >> > > >
> >> > >
> >> >
> >>
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
> >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
> >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> >> > > >
> >> > > > ------------
> >> > > >
> >> > > > The document when I check on solr admin page:
> >> > > > -------------
> >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> >> "createddate":
> >> > > > "2015-04-22T15:00:00Z", "filename":
> >> > "\\\\trunght\\test\\tesseract_3.png",
> >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> >> > > "content": "
> >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >> > > > \n
> >> \n
> >> > \n
> >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >> > > > \n
> >> \n
> >> > ",
> >> > > > "_version_": 1499213034586898400 }
> >> > > >
> >> > > > -----------
> >> > > >
> >> > > > Since I am a solr newbie I do not know where to look, can
> >> > > > anyone
> >> give
> >> > me
> >> > > > an advice for where to look for error or settings to make it work.
> >> > > > Thanks in advanced.
> >> > > >
> >> > > > Trung.
> >> > > >
> >> > >
> >> >
> >>
> >
> >

Reply | Threaded
Open this post in threaded view
|

RE: TIKA OCR not working

Uwe Schindler
In reply to this post by Mattmann, Chris A (3010)
Yes that is fixed.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Mattmann, Chris A (3980) [mailto:[hidden email]]
> Sent: Monday, April 27, 2015 4:29 PM
> To: [hidden email]
> Cc: [hidden email]; [hidden email]
> Subject: Re: TIKA OCR not working
>
> It should work out of the box in Solr as long as Tesseract is installed and on
> the class path. Solr had an issue with it since Tika sends 2 startDocument calls,
> but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I think?
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398) NASA Jet
> Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [hidden email]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: <Allison>, "Timothy B." <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Monday, April 27, 2015 at 10:26 AM
> To: "[hidden email]" <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>, "solr-
> [hidden email]"
> <[hidden email]>
> Subject: FW: TIKA OCR not working
>
> >Trung,
> >
> >I haven't experimented with our OCR parser yet, but this should give a
> >good start: https://wiki.apache.org/tika/TikaOCR .
> >
> >Have you installed tesseract?
> >
> >Tika colleagues,
> >  Any other tips?  What else has to be configured and how?
> >
> >-----Original Message-----
> >From: trung.ht [mailto:[hidden email]]
> >Sent: Friday, April 24, 2015 11:22 PM
> >To: [hidden email]
> >Subject: Re: TIKA OCR not working
> >
> >HI everyone,
> >
> >Does anyone have the answer for this problem :)?
> >
> >
> >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
> >1.7,
> >> but it looks like it does not work. Does anyone know that TIKA OCR
> >> works automatically with Solr or I have to change some settings?
> >>
> >>>
> >Trung.
> >
> >
> >> It's not clear if OCR would happen automatically in Solr Cell, or if
> >>> changes to Solr would be needed.
> >>>
> >>> For Tika OCR info, see:
> >>>
> >>> https://issues.apache.org/jira/browse/TIKA-93
> >>> https://wiki.apache.org/tika/TikaOCR
> >>>
> >>>
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
> >>> [hidden email]>
> >>> wrote:
> >>>
> >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
> >>>seen
> >>> it
> >>> > in use yet.
> >>> >
> >>> > Regards,
> >>> >     Alex
> >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan"
> >>> > <[hidden email]>
> >>> wrote:
> >>> >
> >>> > > Hi Trung,
> >>> > >
> >>> > > I didn't know about OCR capabilities of tika.
> >>> > > Someone who is familiar with sold-cell can inform us whether
> >>> > > this functionality is added to solr or not.
> >>> > >
> >>> > > Ahmet
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht
> >>> > > <[hidden email]>
> >>> wrote:
> >>> > > Hi Ahmet,
> >>> > >
> >>> > > I used a png file, not a pdf file. From the document, I
> >>> > > understand
> >>> that
> >>> > > solr will post the file to tika, and since tika 1.7, OCR is
> >>>included.
> >>> Is
> >>> > > there something I misunderstood.
> >>> > >
> >>> > > Trung.
> >>> > >
> >>> > >
> >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> >>> <[hidden email]
> >>> > >
> >>> > > wrote:
> >>> > >
> >>> > > > Hi Trung,
> >>> > > >
> >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from
> >>> > > > image
> >>> based
> >>> > > > pdfs.
> >>> > > >
> >>> > > > Ahmet
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
> >>> > > > <[hidden email]>
> >>> > wrote:
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > Hi,
> >>> > > >
> >>> > > > I want to use solr to index some scanned document, after
> >>> > > > settings
> >>> solr
> >>> > > > document with a two field "content" and "filename", I tried to
> >>> upload
> >>> > the
> >>> > > > attached file, but it seems that the content of the file is
> >>> > > > only
> >>> "\n \n
> >>> > > > \n....".
> >>> > > > But if I used the tesseract from command line I got the result
> >>> > correctly.
> >>> > > >
> >>> > > > The log when solr receive my request:
> >>> > > > -----------
> >>> > > > INFO  - 2015-04-23 03:49:25.941;
> >>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
> >>>[collection1]
> >>> > > > webapp=/solr path=/update/extract
> >>>params={literal.groupid=2&json.nl
> >>> > > =flat&
> >>> > > > resource.name=phplNiPrs&literal.id
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr
> ue&
> >>>lit
> >>>eral.userid=3&literal.createddate=2015-04-
> 22T15:00:00Z&fmap.content=c
> >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> >>> > > >
> >>> > > > ------------
> >>> > > >
> >>> > > > The document when I check on solr admin page:
> >>> > > > -------------
> >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> >>> "createddate":
> >>> > > > "2015-04-22T15:00:00Z", "filename":
> >>> > "\\\\trunght\\test\\tesseract_3.png",
> >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> >>> > > "content": "
> >>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >>> > > > \n
> >>> \n
> >>> > \n
> >>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >>>\n
> >>> \n
> >>> > ",
> >>> > > > "_version_": 1499213034586898400 }
> >>> > > >
> >>> > > > -----------
> >>> > > >
> >>> > > > Since I am a solr newbie I do not know where to look, can
> >>> > > > anyone
> >>> give
> >>> > me
> >>> > > > an advice for where to look for error or settings to make it
> >>>work.
> >>> > > > Thanks in advanced.
> >>> > > >
> >>> > > > Trung.
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>


Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Konstantin Gribov
JFYI, there's no tesseract & leptonica for centos6/rhel6 (even in epel), so
I have specs for building tesseract and leptonica (its dependency) on
github (https://github.com/grossws/tesseract-ocr-specs). Feel free to use
if you're on centos/rhel.

Also, tesseract language packs are trained for one language each, so
dual-lang document would have quite bad OCR result even when both languages
use latin chars. You can use
o.a.tika.parsers.ocr.TesseractOCRConfig.setLanguage(String) to set lang
pack for OCR.

--
Best regards,
Konstantin Gribov

пн, 27 апр. 2015 г. в 17:36, Uwe Schindler <[hidden email]>:

> Yes that is fixed.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>
> > -----Original Message-----
> > From: Mattmann, Chris A (3980) [mailto:[hidden email]]
> > Sent: Monday, April 27, 2015 4:29 PM
> > To: [hidden email]
> > Cc: [hidden email]; [hidden email]
> > Subject: Re: TIKA OCR not working
> >
> > It should work out of the box in Solr as long as Tesseract is installed
> and on
> > the class path. Solr had an issue with it since Tika sends 2
> startDocument calls,
> > but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I
> think?
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398) NASA Jet
> > Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: [hidden email]
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++
> > Adjunct Associate Professor, Computer Science Department University of
> > Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: <Allison>, "Timothy B." <[hidden email]>
> > Reply-To: "[hidden email]" <[hidden email]>
> > Date: Monday, April 27, 2015 at 10:26 AM
> > To: "[hidden email]" <[hidden email]>
> > Cc: "[hidden email]" <[hidden email]>, "solr-
> > [hidden email]"
> > <[hidden email]>
> > Subject: FW: TIKA OCR not working
> >
> > >Trung,
> > >
> > >I haven't experimented with our OCR parser yet, but this should give a
> > >good start: https://wiki.apache.org/tika/TikaOCR .
> > >
> > >Have you installed tesseract?
> > >
> > >Tika colleagues,
> > >  Any other tips?  What else has to be configured and how?
> > >
> > >-----Original Message-----
> > >From: trung.ht [mailto:[hidden email]]
> > >Sent: Friday, April 24, 2015 11:22 PM
> > >To: [hidden email]
> > >Subject: Re: TIKA OCR not working
> > >
> > >HI everyone,
> > >
> > >Does anyone have the answer for this problem :)?
> > >
> > >
> > >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
> > >1.7,
> > >> but it looks like it does not work. Does anyone know that TIKA OCR
> > >> works automatically with Solr or I have to change some settings?
> > >>
> > >>>
> > >Trung.
> > >
> > >
> > >> It's not clear if OCR would happen automatically in Solr Cell, or if
> > >>> changes to Solr would be needed.
> > >>>
> > >>> For Tika OCR info, see:
> > >>>
> > >>> https://issues.apache.org/jira/browse/TIKA-93
> > >>> https://wiki.apache.org/tika/TikaOCR
> > >>>
> > >>>
> > >>>
> > >>> -- Jack Krupansky
> > >>>
> > >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
> > >>> [hidden email]>
> > >>> wrote:
> > >>>
> > >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
> > >>>seen
> > >>> it
> > >>> > in use yet.
> > >>> >
> > >>> > Regards,
> > >>> >     Alex
> > >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan"
> > >>> > <[hidden email]>
> > >>> wrote:
> > >>> >
> > >>> > > Hi Trung,
> > >>> > >
> > >>> > > I didn't know about OCR capabilities of tika.
> > >>> > > Someone who is familiar with sold-cell can inform us whether
> > >>> > > this functionality is added to solr or not.
> > >>> > >
> > >>> > > Ahmet
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > On Thursday, April 23, 2015 2:06 PM, trung.ht
> > >>> > > <[hidden email]>
> > >>> wrote:
> > >>> > > Hi Ahmet,
> > >>> > >
> > >>> > > I used a png file, not a pdf file. From the document, I
> > >>> > > understand
> > >>> that
> > >>> > > solr will post the file to tika, and since tika 1.7, OCR is
> > >>>included.
> > >>> Is
> > >>> > > there something I misunderstood.
> > >>> > >
> > >>> > > Trung.
> > >>> > >
> > >>> > >
> > >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> > >>> <[hidden email]
> > >>> > >
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi Trung,
> > >>> > > >
> > >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from
> > >>> > > > image
> > >>> based
> > >>> > > > pdfs.
> > >>> > > >
> > >>> > > > Ahmet
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
> > >>> > > > <[hidden email]>
> > >>> > wrote:
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > Hi,
> > >>> > > >
> > >>> > > > I want to use solr to index some scanned document, after
> > >>> > > > settings
> > >>> solr
> > >>> > > > document with a two field "content" and "filename", I tried to
> > >>> upload
> > >>> > the
> > >>> > > > attached file, but it seems that the content of the file is
> > >>> > > > only
> > >>> "\n \n
> > >>> > > > \n....".
> > >>> > > > But if I used the tesseract from command line I got the result
> > >>> > correctly.
> > >>> > > >
> > >>> > > > The log when solr receive my request:
> > >>> > > > -----------
> > >>> > > > INFO  - 2015-04-23 03:49:25.941;
> > >>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
> > >>>[collection1]
> > >>> > > > webapp=/solr path=/update/extract
> > >>>params={literal.groupid=2&json.nl
> > >>> > > =flat&
> > >>> > > > resource.name=phplNiPrs&literal.id
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr
> > ue&
> > >>>lit
> > >>>eral.userid=3&literal.createddate=2015-04-
> > 22T15:00:00Z&fmap.content=c
> > >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> > >>> > > >
> > >>> > > > ------------
> > >>> > > >
> > >>> > > > The document when I check on solr admin page:
> > >>> > > > -------------
> > >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> > >>> "createddate":
> > >>> > > > "2015-04-22T15:00:00Z", "filename":
> > >>> > "\\\\trunght\\test\\tesseract_3.png",
> > >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> > >>> > > "content": "
> > >>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > >>> > > > \n
> > >>> \n
> > >>> > \n
> > >>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> > >>>\n
> > >>> \n
> > >>> > ",
> > >>> > > > "_version_": 1499213034586898400 }
> > >>> > > >
> > >>> > > > -----------
> > >>> > > >
> > >>> > > > Since I am a solr newbie I do not know where to look, can
> > >>> > > > anyone
> > >>> give
> > >>> > me
> > >>> > > > an advice for where to look for error or settings to make it
> > >>>work.
> > >>> > > > Thanks in advanced.
> > >>> > > >
> > >>> > > > Trung.
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Mattmann, Chris A (3010)
Thanks Konstantin!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Konstantin Gribov <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, April 27, 2015 at 12:43 PM
To: "[hidden email]" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, "[hidden email]"
<[hidden email]>
Subject: Re: TIKA OCR not working

>JFYI, there's no tesseract & leptonica for centos6/rhel6 (even in epel),
>so I have specs for building tesseract and leptonica (its dependency) on
>github (https://github.com/grossws/tesseract-ocr-specs).
> Feel free to use if you're on centos/rhel.
>
>
>Also, tesseract language packs are trained for one language each, so
>dual-lang document would have quite bad OCR result even when both
>languages use latin chars. You can use
>o.a.tika.parsers.ocr.TesseractOCRConfig.setLanguage(String) to set lang
>pack for
> OCR.
>
>
>--
>Best regards,
>Konstantin Gribov
>
>
>пн, 27 апр. 2015 г. в 17:36, Uwe Schindler <[hidden email]>:
>
>Yes that is fixed.
>
>-----
>Uwe Schindler
>H.-H.-Meier-Allee 63, D-28213 Bremen
>http://www.thetaphi.de
>eMail: [hidden email]
>
>
>> -----Original Message-----
>> From: Mattmann, Chris A (3980) [mailto:[hidden email]]
>> Sent: Monday, April 27, 2015 4:29 PM
>> To: [hidden email]
>> Cc: [hidden email];
>[hidden email] <mailto:[hidden email]>
>> Subject: Re: TIKA OCR not working
>>
>> It should work out of the box in Solr as long as Tesseract is installed
>>and on
>> the class path. Solr had an issue with it since Tika sends 2
>>startDocument calls,
>> but I fixed that with Uwe and it was shipped in 4.10.4 and in 5.x I
>>think?
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398) NASA Jet
>> Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [hidden email]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> Adjunct Associate Professor, Computer Science Department University of
>> Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <Allison>, "Timothy B." <[hidden email]>
>> Reply-To: "[hidden email]" <[hidden email]>
>> Date: Monday, April 27, 2015 at 10:26 AM
>> To: "[hidden email]" <[hidden email]>
>> Cc: "[hidden email]" <[hidden email]>, "solr-
>> [hidden email]"
>> <[hidden email]>
>> Subject: FW: TIKA OCR not working
>>
>> >Trung,
>> >
>> >I haven't experimented with our OCR parser yet, but this should give a
>> >good start: https://wiki.apache.org/tika/TikaOCR .
>> >
>> >Have you installed tesseract?
>> >
>> >Tika colleagues,
>> >  Any other tips?  What else has to be configured and how?
>> >
>> >-----Original Message-----
>> >From: trung.ht <http://trung.ht> [mailto:[hidden email]]
>> >Sent: Friday, April 24, 2015 11:22 PM
>> >To: [hidden email]
>> >Subject: Re: TIKA OCR not working
>> >
>> >HI everyone,
>> >
>> >Does anyone have the answer for this problem :)?
>> >
>> >
>> >I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>> >1.7,
>> >> but it looks like it does not work. Does anyone know that TIKA OCR
>> >> works automatically with Solr or I have to change some settings?
>> >>
>> >>>
>> >Trung.
>> >
>> >
>> >> It's not clear if OCR would happen automatically in Solr Cell, or if
>> >>> changes to Solr would be needed.
>> >>>
>> >>> For Tika OCR info, see:
>> >>>
>> >>> https://issues.apache.org/jira/browse/TIKA-93
>> >>> https://wiki.apache.org/tika/TikaOCR
>> >>>
>> >>>
>> >>>
>> >>> -- Jack Krupansky
>> >>>
>> >>> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> >>> [hidden email]>
>> >>> wrote:
>> >>>
>> >>> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>> >>>seen
>> >>> it
>> >>> > in use yet.
>> >>> >
>> >>> > Regards,
>> >>> >     Alex
>> >>> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan"
>> >>> > <[hidden email]>
>> >>> wrote:
>> >>> >
>> >>> > > Hi Trung,
>> >>> > >
>> >>> > > I didn't know about OCR capabilities of tika.
>> >>> > > Someone who is familiar with sold-cell can inform us whether
>> >>> > > this functionality is added to solr or not.
>> >>> > >
>> >>> > > Ahmet
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > On Thursday, April 23, 2015 2:06 PM,
>trung.ht <http://trung.ht>
>> >>> > > <[hidden email]>
>> >>> wrote:
>> >>> > > Hi Ahmet,
>> >>> > >
>> >>> > > I used a png file, not a pdf file. From the document, I
>> >>> > > understand
>> >>> that
>> >>> > > solr will post the file to tika, and since tika 1.7, OCR is
>> >>>included.
>> >>> Is
>> >>> > > there something I misunderstood.
>> >>> > >
>> >>> > > Trung.
>> >>> > >
>> >>> > >
>> >>> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> >>> <[hidden email]
>> >>> > >
>> >>> > > wrote:
>> >>> > >
>> >>> > > > Hi Trung,
>> >>> > > >
>> >>> > > > solr-cell (tika) does not do OCR. It cannot exact text from
>> >>> > > > image
>> >>> based
>> >>> > > > pdfs.
>> >>> > > >
>> >>> > > > Ahmet
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > > > On Thursday, April 23, 2015 7:33 AM,
>trung.ht <http://trung.ht>
>> >>> > > > <[hidden email]>
>> >>> > wrote:
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > > > Hi,
>> >>> > > >
>> >>> > > > I want to use solr to index some scanned document, after
>> >>> > > > settings
>> >>> solr
>> >>> > > > document with a two field "content" and "filename", I tried to
>> >>> upload
>> >>> > the
>> >>> > > > attached file, but it seems that the content of the file is
>> >>> > > > only
>> >>> "\n \n
>> >>> > > > \n....".
>> >>> > > > But if I used the tesseract from command line I got the result
>> >>> > correctly.
>> >>> > > >
>> >>> > > > The log when solr receive my request:
>> >>> > > > -----------
>> >>> > > > INFO  - 2015-04-23 03:49:25.941;
>> >>> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>> >>>[collection1]
>> >>> > > > webapp=/solr path=/update/extract
>> >>>params={literal.groupid=2&json.nl <http://json.nl>
>> >>> > > =flat&
>> >>> > > > resource.name <http://resource.name>=phplNiPrs&literal.id
>><http://literal.id>
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>>=4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=tr
>> ue&
>> >>>lit
>> >>>eral.userid=3&literal.createddate=2015-04-
>> 22T15:00:00Z&fmap.content=c
>> >>>ont ent&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> >>> > > >
>> >>> > > > ------------
>> >>> > > >
>> >>> > > > The document when I check on solr admin page:
>> >>> > > > -------------
>> >>> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> >>> "createddate":
>> >>> > > > "2015-04-22T15:00:00Z", "filename":
>> >>> > "\\\\trunght\\test\\tesseract_3.png",
>> >>> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> >>> > > "content": "
>> >>> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> >>> > > > \n
>> >>> \n
>> >>> > \n
>> >>> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> >>>\n
>> >>> \n
>> >>> > ",
>> >>> > > > "_version_": 1499213034586898400 }
>> >>> > > >
>> >>> > > > -----------
>> >>> > > >
>> >>> > > > Since I am a solr newbie I do not know where to look, can
>> >>> > > > anyone
>> >>> give
>> >>> > me
>> >>> > > > an advice for where to look for error or settings to make it
>> >>>work.
>> >>> > > > Thanks in advanced.
>> >>> > > >
>> >>> > > > Trung.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

trung.ht
In reply to this post by Uwe Schindler
Hi Uwe,

Thanks for the answer, but it looks like it does not work on my machine.

I use Mac OS 10.10.3, tesseract is installed through homebrew, and tested with the same file I post to solr.
I think tesseract is on path since I run this command successfully: "tesseract test_tesseract.png output"

On command line, I got correct result (output is the correct content of the image), but when I upload to solr, the content is only some new line characters. (I used 

About log file, I did not see anything abnormal in solr log file (nothing abnormal after my POST request), am I missing another log file?

With best regards,
Trung.


On Mon, Apr 27, 2015 at 9:34 PM, Uwe Schindler <[hidden email]> wrote:
Hi,
TIKA OCR is definitely working automatically with Solr 5.x.

It is just important to install TesseractOCR on path (which is a native tool that does the actual work). On Ubuntu Linux, this should be quite simple ("apt-get install tesseract-ocr" or like that). You may also need to ainstall additional language for better results.

Unless you are on a Turkish localized machine (which causes a bug in the JDK on spawning external processes) and the native tools are installed, it should work OOB, no configuration needed. Please also check log files.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Allison, Timothy B. [mailto:[hidden email]]
> Sent: Monday, April 27, 2015 4:27 PM
> To: [hidden email]
> Cc: [hidden email]; [hidden email]
> Subject: FW: TIKA OCR not working
>
> Trung,
>
> I haven't experimented with our OCR parser yet, but this should give a good
> start: https://wiki.apache.org/tika/TikaOCR .
>
> Have you installed tesseract?
>
> Tika colleagues,
>   Any other tips?  What else has to be configured and how?
>
> -----Original Message-----
> From: trung.ht [mailto:[hidden email]]
> Sent: Friday, April 24, 2015 11:22 PM
> To: [hidden email]
> Subject: Re: TIKA OCR not working
>
> HI everyone,
>
> Does anyone have the answer for this problem :)?
>
>
> I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika 1.7,
> > but it looks like it does not work. Does anyone know that TIKA OCR
> > works automatically with Solr or I have to change some settings?
> >
> >>
> Trung.
>
>
> > It's not clear if OCR would happen automatically in Solr Cell, or if
> >> changes to Solr would be needed.
> >>
> >> For Tika OCR info, see:
> >>
> >> https://issues.apache.org/jira/browse/TIKA-93
> >> https://wiki.apache.org/tika/TikaOCR
> >>
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
> >> [hidden email]>
> >> wrote:
> >>
> >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
> >> > seen
> >> it
> >> > in use yet.
> >> >
> >> > Regards,
> >> >     Alex
> >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]>
> >> wrote:
> >> >
> >> > > Hi Trung,
> >> > >
> >> > > I didn't know about OCR capabilities of tika.
> >> > > Someone who is familiar with sold-cell can inform us whether this
> >> > > functionality is added to solr or not.
> >> > >
> >> > > Ahmet
> >> > >
> >> > >
> >> > >
> >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]>
> >> wrote:
> >> > > Hi Ahmet,
> >> > >
> >> > > I used a png file, not a pdf file. From the document, I
> >> > > understand
> >> that
> >> > > solr will post the file to tika, and since tika 1.7, OCR is included.
> >> Is
> >> > > there something I misunderstood.
> >> > >
> >> > > Trung.
> >> > >
> >> > >
> >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
> >> <[hidden email]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi Trung,
> >> > > >
> >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
> >> > > > image
> >> based
> >> > > > pdfs.
> >> > > >
> >> > > > Ahmet
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
> >> > > > <[hidden email]>
> >> > wrote:
> >> > > >
> >> > > >
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > I want to use solr to index some scanned document, after
> >> > > > settings
> >> solr
> >> > > > document with a two field "content" and "filename", I tried to
> >> upload
> >> > the
> >> > > > attached file, but it seems that the content of the file is
> >> > > > only
> >> "\n \n
> >> > > > \n....".
> >> > > > But if I used the tesseract from command line I got the result
> >> > correctly.
> >> > > >
> >> > > > The log when solr receive my request:
> >> > > > -----------
> >> > > > INFO  - 2015-04-23 03:49:25.941;
> >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
> >> > > > [collection1] webapp=/solr path=/update/extract
> >> > > > params={literal.groupid=2&json.nl
> >> > > =flat&
> >> > > > resource.name=phplNiPrs&literal.id
> >> > > >
> >> > >
> >> >
> >>
> =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
> >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
> >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
> >> > > >
> >> > > > ------------
> >> > > >
> >> > > > The document when I check on solr admin page:
> >> > > > -------------
> >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
> >> "createddate":
> >> > > > "2015-04-22T15:00:00Z", "filename":
> >> > "\\\\trunght\\test\\tesseract_3.png",
> >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
> >> > > "content": "
> >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >> > > > \n
> >> \n
> >> > \n
> >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
> >> > > > \n
> >> \n
> >> > ",
> >> > > > "_version_": 1499213034586898400 }
> >> > > >
> >> > > > -----------
> >> > > >
> >> > > > Since I am a solr newbie I do not know where to look, can
> >> > > > anyone
> >> give
> >> > me
> >> > > > an advice for where to look for error or settings to make it work.
> >> > > > Thanks in advanced.
> >> > > >
> >> > > > Trung.
> >> > > >
> >> > >
> >> >
> >>
> >
> >


Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

trung.ht
Hi Uwe,

Today, I downloaded Solr 5.1 and it worked fine. It seems that this bug fix
SOLR-7139 is only included in 5.1, not 5.0.

Thank everyone for your support.

Trung.

On Tue, Apr 28, 2015 at 10:21 AM, trung.ht <[hidden email]> wrote:

> Hi Uwe,
>
> Thanks for the answer, but it looks like it does not work on my machine.
>
> I use Mac OS 10.10.3, tesseract is installed through homebrew, and tested
> with the same file I post to solr.
> I think tesseract is on path since I run this command successfully: "tesseract
> test_tesseract.png output"
>
> On command line, I got correct result (output is the correct content of
> the image), but when I upload to solr, the content is only some new line
> characters. (I used
>
> About log file, I did not see anything abnormal in solr log file (nothing
> abnormal after my POST request), am I missing another log file?
>
> With best regards,
> Trung.
>
>
> On Mon, Apr 27, 2015 at 9:34 PM, Uwe Schindler <[hidden email]> wrote:
>
>> Hi,
>> TIKA OCR is definitely working automatically with Solr 5.x.
>>
>> It is just important to install TesseractOCR on path (which is a native
>> tool that does the actual work). On Ubuntu Linux, this should be quite
>> simple ("apt-get install tesseract-ocr" or like that). You may also need to
>> ainstall additional language for better results.
>>
>> Unless you are on a Turkish localized machine (which causes a bug in the
>> JDK on spawning external processes) and the native tools are installed, it
>> should work OOB, no configuration needed. Please also check log files.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: [hidden email]
>>
>>
>> > -----Original Message-----
>> > From: Allison, Timothy B. [mailto:[hidden email]]
>> > Sent: Monday, April 27, 2015 4:27 PM
>> > To: [hidden email]
>> > Cc: [hidden email]; [hidden email]
>> > Subject: FW: TIKA OCR not working
>> >
>> > Trung,
>> >
>> > I haven't experimented with our OCR parser yet, but this should give a
>> good
>> > start: https://wiki.apache.org/tika/TikaOCR .
>> >
>> > Have you installed tesseract?
>> >
>> > Tika colleagues,
>> >   Any other tips?  What else has to be configured and how?
>> >
>> > -----Original Message-----
>> > From: trung.ht [mailto:[hidden email]]
>> > Sent: Friday, April 24, 2015 11:22 PM
>> > To: [hidden email]
>> > Subject: Re: TIKA OCR not working
>> >
>> > HI everyone,
>> >
>> > Does anyone have the answer for this problem :)?
>> >
>> >
>> > I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>> 1.7,
>> > > but it looks like it does not work. Does anyone know that TIKA OCR
>> > > works automatically with Solr or I have to change some settings?
>> > >
>> > >>
>> > Trung.
>> >
>> >
>> > > It's not clear if OCR would happen automatically in Solr Cell, or if
>> > >> changes to Solr would be needed.
>> > >>
>> > >> For Tika OCR info, see:
>> > >>
>> > >> https://issues.apache.org/jira/browse/TIKA-93
>> > >> https://wiki.apache.org/tika/TikaOCR
>> > >>
>> > >>
>> > >>
>> > >> -- Jack Krupansky
>> > >>
>> > >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>> > >> [hidden email]>
>> > >> wrote:
>> > >>
>> > >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>> > >> > seen
>> > >> it
>> > >> > in use yet.
>> > >> >
>> > >> > Regards,
>> > >> >     Alex
>> > >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]
>> >
>> > >> wrote:
>> > >> >
>> > >> > > Hi Trung,
>> > >> > >
>> > >> > > I didn't know about OCR capabilities of tika.
>> > >> > > Someone who is familiar with sold-cell can inform us whether this
>> > >> > > functionality is added to solr or not.
>> > >> > >
>> > >> > > Ahmet
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]
>> >
>> > >> wrote:
>> > >> > > Hi Ahmet,
>> > >> > >
>> > >> > > I used a png file, not a pdf file. From the document, I
>> > >> > > understand
>> > >> that
>> > >> > > solr will post the file to tika, and since tika 1.7, OCR is
>> included.
>> > >> Is
>> > >> > > there something I misunderstood.
>> > >> > >
>> > >> > > Trung.
>> > >> > >
>> > >> > >
>> > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>> > >> <[hidden email]
>> > >> > >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi Trung,
>> > >> > > >
>> > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
>> > >> > > > image
>> > >> based
>> > >> > > > pdfs.
>> > >> > > >
>> > >> > > > Ahmet
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
>> > >> > > > <[hidden email]>
>> > >> > wrote:
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > I want to use solr to index some scanned document, after
>> > >> > > > settings
>> > >> solr
>> > >> > > > document with a two field "content" and "filename", I tried to
>> > >> upload
>> > >> > the
>> > >> > > > attached file, but it seems that the content of the file is
>> > >> > > > only
>> > >> "\n \n
>> > >> > > > \n....".
>> > >> > > > But if I used the tesseract from command line I got the result
>> > >> > correctly.
>> > >> > > >
>> > >> > > > The log when solr receive my request:
>> > >> > > > -----------
>> > >> > > > INFO  - 2015-04-23 03:49:25.941;
>> > >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>> > >> > > > [collection1] webapp=/solr path=/update/extract
>> > >> > > > params={literal.groupid=2&json.nl
>> > >> > > =flat&
>> > >> > > > resource.name=phplNiPrs&literal.id
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
>> > >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
>> > >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>> > >> > > >
>> > >> > > > ------------
>> > >> > > >
>> > >> > > > The document when I check on solr admin page:
>> > >> > > > -------------
>> > >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>> > >> "createddate":
>> > >> > > > "2015-04-22T15:00:00Z", "filename":
>> > >> > "\\\\trunght\\test\\tesseract_3.png",
>> > >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>> > >> > > "content": "
>> > >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> > >> > > > \n
>> > >> \n
>> > >> > \n
>> > >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>> > >> > > > \n
>> > >> \n
>> > >> > ",
>> > >> > > > "_version_": 1499213034586898400 }
>> > >> > > >
>> > >> > > > -----------
>> > >> > > >
>> > >> > > > Since I am a solr newbie I do not know where to look, can
>> > >> > > > anyone
>> > >> give
>> > >> > me
>> > >> > > > an advice for where to look for error or settings to make it
>> work.
>> > >> > > > Thanks in advanced.
>> > >> > > >
>> > >> > > > Trung.
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: TIKA OCR not working

Erick Erickson
Yes, the critical bit for knowing what release a JIRA is in is the
"Fix Version/s" entry.
You have to be a little careful though to only read that when the
Resolution is "Fixed",
as the "fix version" is sometimes set while the JIRA is still open.

On Tue, Apr 28, 2015 at 8:52 PM, trung.ht <[hidden email]> wrote:

> Hi Uwe,
>
> Today, I downloaded Solr 5.1 and it worked fine. It seems that this bug fix
> SOLR-7139 is only included in 5.1, not 5.0.
>
> Thank everyone for your support.
>
> Trung.
>
> On Tue, Apr 28, 2015 at 10:21 AM, trung.ht <[hidden email]> wrote:
>
>> Hi Uwe,
>>
>> Thanks for the answer, but it looks like it does not work on my machine.
>>
>> I use Mac OS 10.10.3, tesseract is installed through homebrew, and tested
>> with the same file I post to solr.
>> I think tesseract is on path since I run this command successfully: "tesseract
>> test_tesseract.png output"
>>
>> On command line, I got correct result (output is the correct content of
>> the image), but when I upload to solr, the content is only some new line
>> characters. (I used
>>
>> About log file, I did not see anything abnormal in solr log file (nothing
>> abnormal after my POST request), am I missing another log file?
>>
>> With best regards,
>> Trung.
>>
>>
>> On Mon, Apr 27, 2015 at 9:34 PM, Uwe Schindler <[hidden email]> wrote:
>>
>>> Hi,
>>> TIKA OCR is definitely working automatically with Solr 5.x.
>>>
>>> It is just important to install TesseractOCR on path (which is a native
>>> tool that does the actual work). On Ubuntu Linux, this should be quite
>>> simple ("apt-get install tesseract-ocr" or like that). You may also need to
>>> ainstall additional language for better results.
>>>
>>> Unless you are on a Turkish localized machine (which causes a bug in the
>>> JDK on spawning external processes) and the native tools are installed, it
>>> should work OOB, no configuration needed. Please also check log files.
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: [hidden email]
>>>
>>>
>>> > -----Original Message-----
>>> > From: Allison, Timothy B. [mailto:[hidden email]]
>>> > Sent: Monday, April 27, 2015 4:27 PM
>>> > To: [hidden email]
>>> > Cc: [hidden email]; [hidden email]
>>> > Subject: FW: TIKA OCR not working
>>> >
>>> > Trung,
>>> >
>>> > I haven't experimented with our OCR parser yet, but this should give a
>>> good
>>> > start: https://wiki.apache.org/tika/TikaOCR .
>>> >
>>> > Have you installed tesseract?
>>> >
>>> > Tika colleagues,
>>> >   Any other tips?  What else has to be configured and how?
>>> >
>>> > -----Original Message-----
>>> > From: trung.ht [mailto:[hidden email]]
>>> > Sent: Friday, April 24, 2015 11:22 PM
>>> > To: [hidden email]
>>> > Subject: Re: TIKA OCR not working
>>> >
>>> > HI everyone,
>>> >
>>> > Does anyone have the answer for this problem :)?
>>> >
>>> >
>>> > I saw the document of Tika. Tika 1.7 support OCR and Solr 5.0 use Tika
>>> 1.7,
>>> > > but it looks like it does not work. Does anyone know that TIKA OCR
>>> > > works automatically with Solr or I have to change some settings?
>>> > >
>>> > >>
>>> > Trung.
>>> >
>>> >
>>> > > It's not clear if OCR would happen automatically in Solr Cell, or if
>>> > >> changes to Solr would be needed.
>>> > >>
>>> > >> For Tika OCR info, see:
>>> > >>
>>> > >> https://issues.apache.org/jira/browse/TIKA-93
>>> > >> https://wiki.apache.org/tika/TikaOCR
>>> > >>
>>> > >>
>>> > >>
>>> > >> -- Jack Krupansky
>>> > >>
>>> > >> On Thu, Apr 23, 2015 at 9:14 AM, Alexandre Rafalovitch <
>>> > >> [hidden email]>
>>> > >> wrote:
>>> > >>
>>> > >> > I think OCR is in Tika 1.8, so might be in Solr 5.?. But I haven't
>>> > >> > seen
>>> > >> it
>>> > >> > in use yet.
>>> > >> >
>>> > >> > Regards,
>>> > >> >     Alex
>>> > >> > On 23 Apr 2015 10:24 pm, "Ahmet Arslan" <[hidden email]
>>> >
>>> > >> wrote:
>>> > >> >
>>> > >> > > Hi Trung,
>>> > >> > >
>>> > >> > > I didn't know about OCR capabilities of tika.
>>> > >> > > Someone who is familiar with sold-cell can inform us whether this
>>> > >> > > functionality is added to solr or not.
>>> > >> > >
>>> > >> > > Ahmet
>>> > >> > >
>>> > >> > >
>>> > >> > >
>>> > >> > > On Thursday, April 23, 2015 2:06 PM, trung.ht <[hidden email]
>>> >
>>> > >> wrote:
>>> > >> > > Hi Ahmet,
>>> > >> > >
>>> > >> > > I used a png file, not a pdf file. From the document, I
>>> > >> > > understand
>>> > >> that
>>> > >> > > solr will post the file to tika, and since tika 1.7, OCR is
>>> included.
>>> > >> Is
>>> > >> > > there something I misunderstood.
>>> > >> > >
>>> > >> > > Trung.
>>> > >> > >
>>> > >> > >
>>> > >> > > On Thu, Apr 23, 2015 at 5:59 PM, Ahmet Arslan
>>> > >> <[hidden email]
>>> > >> > >
>>> > >> > > wrote:
>>> > >> > >
>>> > >> > > > Hi Trung,
>>> > >> > > >
>>> > >> > > > solr-cell (tika) does not do OCR. It cannot exact text from
>>> > >> > > > image
>>> > >> based
>>> > >> > > > pdfs.
>>> > >> > > >
>>> > >> > > > Ahmet
>>> > >> > > >
>>> > >> > > >
>>> > >> > > >
>>> > >> > > > On Thursday, April 23, 2015 7:33 AM, trung.ht
>>> > >> > > > <[hidden email]>
>>> > >> > wrote:
>>> > >> > > >
>>> > >> > > >
>>> > >> > > >
>>> > >> > > > Hi,
>>> > >> > > >
>>> > >> > > > I want to use solr to index some scanned document, after
>>> > >> > > > settings
>>> > >> solr
>>> > >> > > > document with a two field "content" and "filename", I tried to
>>> > >> upload
>>> > >> > the
>>> > >> > > > attached file, but it seems that the content of the file is
>>> > >> > > > only
>>> > >> "\n \n
>>> > >> > > > \n....".
>>> > >> > > > But if I used the tesseract from command line I got the result
>>> > >> > correctly.
>>> > >> > > >
>>> > >> > > > The log when solr receive my request:
>>> > >> > > > -----------
>>> > >> > > > INFO  - 2015-04-23 03:49:25.941;
>>> > >> > > > org.apache.solr.update.processor.LogUpdateProcessor;
>>> > >> > > > [collection1] webapp=/solr path=/update/extract
>>> > >> > > > params={literal.groupid=2&json.nl
>>> > >> > > =flat&
>>> > >> > > > resource.name=phplNiPrs&literal.id
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > =4&commit=true&extractOnly=false&literal.historyid=4&omitHeader=true&
>>> > >> literal.userid=3&literal.createddate=2015-04-22T15:00:00Z&fmap.conten
>>> > >> t=content&wt=json&literal.filename=\\trunght\test\tesseract_3.png}
>>> > >> > > >
>>> > >> > > > ------------
>>> > >> > > >
>>> > >> > > > The document when I check on solr admin page:
>>> > >> > > > -------------
>>> > >> > > > { "groupid": 2, "id": "4", "historyid": 4, "userid": 3,
>>> > >> "createddate":
>>> > >> > > > "2015-04-22T15:00:00Z", "filename":
>>> > >> > "\\\\trunght\\test\\tesseract_3.png",
>>> > >> > > > "autocomplete_text": [ "\\\\trunght\\test\\tesseract_3.png" ],
>>> > >> > > "content": "
>>> > >> > > > \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>> > >> > > > \n
>>> > >> \n
>>> > >> > \n
>>> > >> > > > \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n
>>> > >> > > > \n
>>> > >> \n
>>> > >> > ",
>>> > >> > > > "_version_": 1499213034586898400 }
>>> > >> > > >
>>> > >> > > > -----------
>>> > >> > > >
>>> > >> > > > Since I am a solr newbie I do not know where to look, can
>>> > >> > > > anyone
>>> > >> give
>>> > >> > me
>>> > >> > > > an advice for where to look for error or settings to make it
>>> work.
>>> > >> > > > Thanks in advanced.
>>> > >> > > >
>>> > >> > > > Trung.
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>>
>>>
>>