Improving Tika OCR

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving Tika OCR

Kranthi Kiran G V
Hello Tim Allison,

I am currently working on improving Tika's OCR capabilities.
After suggestion from Thamme Gowda (@thammegowda
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda>),
I started to work on comparison of Tesseract 4.0's neural network
<https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00>
subsystem and Visual Geometry Group's (VGG) models
<http://www.robots.ox.ac.uk/~vgg/research/text/>.

It would be great if you provide the dataset to test the OCR as you
mentioned in one of the issues.

I would be comparing their running time for evaluation, accuracy, memory
consumed and invariance to lighting, orientation, etc. And then I would be
integrating the appropriate models into Tika's OCR.

Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Thamme Gowda
Thanks, Kranthi, for volunteering to do this evaluation :-)

Best,
Thamme


--
Thamme Gowda
TG | @thammegowda
~Sent via somebody's IMAP server


On Apr 17, 2017 4:46 AM, "Kranthi Kiran G V" <[hidden email]>
wrote:

Hello Tim Allison,

I am currently working on improving Tika's OCR capabilities.
After suggestion from Thamme Gowda (@thammegowda
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda>),
I started to work on comparison of Tesseract 4.0's neural network
<https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00>
subsystem and Visual Geometry Group's (VGG) models
<http://www.robots.ox.ac.uk/~vgg/research/text/>.

It would be great if you provide the dataset to test the OCR as you
mentioned in one of the issues.

I would be comparing their running time for evaluation, accuracy, memory
consumed and invariance to lighting, orientation, etc. And then I would be
integrating the appropriate models into Tika's OCR.

Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Luís Filipe Nassif
In reply to this post by Kranthi Kiran G V
Hi Kranthi,

That is an interesting comparison! But I think Tesseract 4.0 is still
alpha? And do you know the VGG software license?

Best,
Luis

Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
[hidden email]> escreveu:

Hello Tim Allison,

I am currently working on improving Tika's OCR capabilities.
After suggestion from Thamme Gowda (@thammegowda
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda>),
I started to work on comparison of Tesseract 4.0's neural network
<https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00>
subsystem and Visual Geometry Group's (VGG) models
<http://www.robots.ox.ac.uk/~vgg/research/text/>.

It would be great if you provide the dataset to test the OCR as you
mentioned in one of the issues.

I would be comparing their running time for evaluation, accuracy, memory
consumed and invariance to lighting, orientation, etc. And then I would be
integrating the appropriate models into Tika's OCR.

Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Kranthi Kiran G V
Hello Luis,
Yes, tesseract 4.0 is not yet a stable release. VGG group's model has a
3-clause BSD license.

I see it as a long term effort which would help the Tika's community
experience near state of art OCR.

This is an investigation into it to see if we can try out this direction.
Thanks for expressing your views.

Thank you,
Kranthi Kiran GV

On Apr 18, 2017 2:44 AM, "Luís Filipe Nassif" <[hidden email]> wrote:

Hi Kranthi,

That is an interesting comparison! But I think Tesseract 4.0 is still
alpha? And do you know the VGG software license?

Best,
Luis

Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
[hidden email]> escreveu:

Hello Tim Allison,

I am currently working on improving Tika's OCR capabilities.
After suggestion from Thamme Gowda (@thammegowda
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda>),
I started to work on comparison of Tesseract 4.0's neural network
<https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00>
subsystem and Visual Geometry Group's (VGG) models
<http://www.robots.ox.ac.uk/~vgg/research/text/>.

It would be great if you provide the dataset to test the OCR as you
mentioned in one of the issues.

I would be comparing their running time for evaluation, accuracy, memory
consumed and invariance to lighting, orientation, etc. And then I would be
integrating the appropriate models into Tika's OCR.

Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Kranthi Kiran G V
Hello community,
I have successfully tested Tesseract 4.0 on various images of different
sizes, orientation and lightening
conditions. I would, in the next few days, publish the results on a blog
for you to have a look at.

Although I'm able to reliably measure the clock time, accuracy, etc, I am
not able to come up with a method
to reliably measure the memory consumed. Any pointers on this from the
developer community would be
appreciated.

VGG group has two models released
<http://www.robots.ox.ac.uk/~vgg/research/text/#sec-models>. I'm not able
to test any as of now due to no back compatibility with
the MatConvNet used. I use a recent version of MATLAB. As of now, I am
trying to get around it by updating
parts of the code. I'm also contacting the mainters of the repository to
help me address the issues.
I'm hopeful to run them.

Addressing Luis' concern, we won't be building VGG's models into Tika'
source. We would only be helping
the user deploy a REST API to which Tika's OCR subsystem passes the images
and retrieve the information
in the form of a string.

Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal

On Tue, Apr 18, 2017 at 8:43 AM, Kranthi Kiran G V <
[hidden email]> wrote:

> Hello Luis,
> Yes, tesseract 4.0 is not yet a stable release. VGG group's model has a
> 3-clause BSD license.
>
> I see it as a long term effort which would help the Tika's community
> experience near state of art OCR.
>
> This is an investigation into it to see if we can try out this direction.
> Thanks for expressing your views.
>
> Thank you,
> Kranthi Kiran GV
>
> On Apr 18, 2017 2:44 AM, "Luís Filipe Nassif" <[hidden email]> wrote:
>
> Hi Kranthi,
>
> That is an interesting comparison! But I think Tesseract 4.0 is still
> alpha? And do you know the VGG software license?
>
> Best,
> Luis
>
> Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
> [hidden email]> escreveu:
>
> Hello Tim Allison,
>
> I am currently working on improving Tika's OCR capabilities.
> After suggestion from Thamme Gowda (@thammegowda
> <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda
> >),
> I started to work on comparison of Tesseract 4.0's neural network
> <https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
> >
> subsystem and Visual Geometry Group's (VGG) models
> <http://www.robots.ox.ac.uk/~vgg/research/text/>.
>
> It would be great if you provide the dataset to test the OCR as you
> mentioned in one of the issues.
>
> I would be comparing their running time for evaluation, accuracy, memory
> consumed and invariance to lighting, orientation, etc. And then I would be
> integrating the appropriate models into Tika's OCR.
>
> Thank you,
> Kranthi Kiran GV,
> CS 3/4 Undergrad,
> NIT Warangal
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Thamme Gowda
Hi Kranthi,

Thanks for updating us.
I believe in the long run both of these two models may co-exist (tesseract
for flat-bench scanner images with perfect lighting conditions, VGG models
for natural images taken by cellphone/digital cameras with weird
orientations and lighting conditions).

I agree with you, we can make VGG OCR as an optional REST API and allow
users to agree their license if they want to use it. Thanks Luis for the
feedback :-)

Keep up the good work and keep this email thread updated with your findings.

Thanks,
TG

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Wed, Apr 19, 2017 at 6:12 AM, Kranthi Kiran G V <
[hidden email]> wrote:

> Hello community,
> I have successfully tested Tesseract 4.0 on various images of different
> sizes, orientation and lightening
> conditions. I would, in the next few days, publish the results on a blog
> for you to have a look at.
>
> Although I'm able to reliably measure the clock time, accuracy, etc, I am
> not able to come up with a method
> to reliably measure the memory consumed. Any pointers on this from the
> developer community would be
> appreciated.
>
> VGG group has two models released
> <http://www.robots.ox.ac.uk/~vgg/research/text/#sec-models>. I'm not able
> to test any as of now due to no back compatibility with
> the MatConvNet used. I use a recent version of MATLAB. As of now, I am
> trying to get around it by updating
> parts of the code. I'm also contacting the mainters of the repository to
> help me address the issues.
> I'm hopeful to run them.
>
> Addressing Luis' concern, we won't be building VGG's models into Tika'
> source. We would only be helping
> the user deploy a REST API to which Tika's OCR subsystem passes the images
> and retrieve the information
> in the form of a string.
>
> Thank you,
> Kranthi Kiran GV,
> CS 3/4 Undergrad,
> NIT Warangal
>
> On Tue, Apr 18, 2017 at 8:43 AM, Kranthi Kiran G V <
> [hidden email]> wrote:
>
> > Hello Luis,
> > Yes, tesseract 4.0 is not yet a stable release. VGG group's model has a
> > 3-clause BSD license.
> >
> > I see it as a long term effort which would help the Tika's community
> > experience near state of art OCR.
> >
> > This is an investigation into it to see if we can try out this direction.
> > Thanks for expressing your views.
> >
> > Thank you,
> > Kranthi Kiran GV
> >
> > On Apr 18, 2017 2:44 AM, "Luís Filipe Nassif" <[hidden email]>
> wrote:
> >
> > Hi Kranthi,
> >
> > That is an interesting comparison! But I think Tesseract 4.0 is still
> > alpha? And do you know the VGG software license?
> >
> > Best,
> > Luis
> >
> > Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
> > [hidden email]> escreveu:
> >
> > Hello Tim Allison,
> >
> > I am currently working on improving Tika's OCR capabilities.
> > After suggestion from Thamme Gowda (@thammegowda
> > <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=thammegowda
> > >),
> > I started to work on comparison of Tesseract 4.0's neural network
> > <https://github.com/tesseract-ocr/tesseract/wiki/
> NeuralNetsInTesseract4.00
> > >
> > subsystem and Visual Geometry Group's (VGG) models
> > <http://www.robots.ox.ac.uk/~vgg/research/text/>.
> >
> > It would be great if you provide the dataset to test the OCR as you
> > mentioned in one of the issues.
> >
> > I would be comparing their running time for evaluation, accuracy, memory
> > consumed and invariance to lighting, orientation, etc. And then I would
> be
> > integrating the appropriate models into Tika's OCR.
> >
> > Thank you,
> > Kranthi Kiran GV,
> > CS 3/4 Undergrad,
> > NIT Warangal
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Kranthi Kiran G V
Hello Thamme,

Agreed. Looking at the paper[1], it seems to me that tesseract and VGG
models can co-exist
in Tika to serve all kinds of input images.

I am able to run one of the models Deep Features for Text Spotting[2] by
disabling the GPU.
It however doesn't generate any text, but generates only features. The
initial assumption that
MATLAB version is creating an issue is thus proven wrong.
The problem lies with the MatConvNet that is bundled with the models. It is
a very old version
which doesn't even resemble the current structure. I'm having problems to
build it on my system
for the other model, Synthetic Data and Artificial Neural Networks for
Natural Scene Text Recognition[1].
Note that both of them are supplied with custom versions of MatConvNet.

Nevertheless, we can build the system to use a latest version of MatConvNet
by building it layer
by layer looking at the MAT file[3]. I want to hear your views on whether
or not I should attempt it.


Thank you,
Kranthi Kiran GV,
CS 3/4 Undergrad,
NIT Warangal



[1]
http://www.robots.ox.ac.uk/~vgg/publications/2014/Jaderberg14c/jaderberg14c.pdf
[2]
http://www.robots.ox.ac.uk/~vgg/publications/2014/Jaderberg14/jaderberg14.pdf.pdf
[3] https://github.com/vlfeat/matconvnet/issues/239

On Wed, Apr 19, 2017 at 10:42 PM, Thamme Gowda <[hidden email]>
wrote:

> Hi Kranthi,
>
> Thanks for updating us.
> I believe in the long run both of these two models may co-exist (tesseract
> for flat-bench scanner images with perfect lighting conditions, VGG models
> for natural images taken by cellphone/digital cameras with weird
> orientations and lighting conditions).
>
> I agree with you, we can make VGG OCR as an optional REST API and allow
> users to agree their license if they want to use it. Thanks Luis for the
> feedback :-)
>
> Keep up the good work and keep this email thread updated with your
> findings.
>
> Thanks,
> TG
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
>
> On Wed, Apr 19, 2017 at 6:12 AM, Kranthi Kiran G V <
> [hidden email]> wrote:
>
>> Hello community,
>> I have successfully tested Tesseract 4.0 on various images of different
>> sizes, orientation and lightening
>> conditions. I would, in the next few days, publish the results on a blog
>> for you to have a look at.
>>
>> Although I'm able to reliably measure the clock time, accuracy, etc, I am
>> not able to come up with a method
>> to reliably measure the memory consumed. Any pointers on this from the
>> developer community would be
>> appreciated.
>>
>> VGG group has two models released
>> <http://www.robots.ox.ac.uk/~vgg/research/text/#sec-models>. I'm not able
>>
>> to test any as of now due to no back compatibility with
>> the MatConvNet used. I use a recent version of MATLAB. As of now, I am
>> trying to get around it by updating
>> parts of the code. I'm also contacting the mainters of the repository to
>> help me address the issues.
>> I'm hopeful to run them.
>>
>> Addressing Luis' concern, we won't be building VGG's models into Tika'
>> source. We would only be helping
>> the user deploy a REST API to which Tika's OCR subsystem passes the images
>> and retrieve the information
>> in the form of a string.
>>
>> Thank you,
>> Kranthi Kiran GV,
>> CS 3/4 Undergrad,
>> NIT Warangal
>>
>> On Tue, Apr 18, 2017 at 8:43 AM, Kranthi Kiran G V <
>> [hidden email]> wrote:
>>
>> > Hello Luis,
>> > Yes, tesseract 4.0 is not yet a stable release. VGG group's model has a
>> > 3-clause BSD license.
>> >
>> > I see it as a long term effort which would help the Tika's community
>> > experience near state of art OCR.
>> >
>> > This is an investigation into it to see if we can try out this
>> direction.
>> > Thanks for expressing your views.
>> >
>> > Thank you,
>> > Kranthi Kiran GV
>> >
>> > On Apr 18, 2017 2:44 AM, "Luís Filipe Nassif" <[hidden email]>
>> wrote:
>> >
>> > Hi Kranthi,
>> >
>> > That is an interesting comparison! But I think Tesseract 4.0 is still
>> > alpha? And do you know the VGG software license?
>> >
>> > Best,
>> > Luis
>> >
>> > Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
>> > [hidden email]> escreveu:
>> >
>> > Hello Tim Allison,
>> >
>> > I am currently working on improving Tika's OCR capabilities.
>> > After suggestion from Thamme Gowda (@thammegowda
>> > <https://issues.apache.org/jira/secure/ViewProfile.jspa?name
>> =thammegowda
>> > >),
>> > I started to work on comparison of Tesseract 4.0's neural network
>> > <https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsI
>> nTesseract4.00
>> > >
>> > subsystem and Visual Geometry Group's (VGG) models
>> > <http://www.robots.ox.ac.uk/~vgg/research/text/>.
>> >
>> > It would be great if you provide the dataset to test the OCR as you
>> > mentioned in one of the issues.
>> >
>> > I would be comparing their running time for evaluation, accuracy, memory
>> > consumed and invariance to lighting, orientation, etc. And then I would
>> be
>> > integrating the appropriate models into Tika's OCR.
>> >
>> > Thank you,
>> > Kranthi Kiran GV,
>> > CS 3/4 Undergrad,
>> > NIT Warangal
>> >
>> >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Improving Tika OCR

Thamme Gowda
Thanks, Kranthi.

Keep us informed about how it goes.

Cheers,
TG

On Thu, Apr 20, 2017 at 1:01 PM, Kranthi Kiran G V <
[hidden email]> wrote:

> Hello Thamme,
>
> Agreed. Looking at the paper[1], it seems to me that tesseract and VGG
> models can co-exist
> in Tika to serve all kinds of input images.
>
> I am able to run one of the models Deep Features for Text Spotting[2] by
> disabling the GPU.
> It however doesn't generate any text, but generates only features. The
> initial assumption that
> MATLAB version is creating an issue is thus proven wrong.
> The problem lies with the MatConvNet that is bundled with the models. It
> is a very old version
> which doesn't even resemble the current structure. I'm having problems to
> build it on my system
> for the other model, Synthetic Data and Artificial Neural Networks for
> Natural Scene Text Recognition[1].
> Note that both of them are supplied with custom versions of MatConvNet.
>
> Nevertheless, we can build the system to use a latest version of
> MatConvNet by building it layer
> by layer looking at the MAT file[3]. I want to hear your views on whether
> or not I should attempt it.
>
>
> Thank you,
> Kranthi Kiran GV,
> CS 3/4 Undergrad,
> NIT Warangal
>
>
>
> [1] http://www.robots.ox.ac.uk/~vgg/publications/2014/
> Jaderberg14c/jaderberg14c.pdf
> [2] http://www.robots.ox.ac.uk/~vgg/publications/2014/
> Jaderberg14/jaderberg14.pdf.pdf
> [3] https://github.com/vlfeat/matconvnet/issues/239
>
> On Wed, Apr 19, 2017 at 10:42 PM, Thamme Gowda <[hidden email]>
> wrote:
>
>> Hi Kranthi,
>>
>> Thanks for updating us.
>> I believe in the long run both of these two models may co-exist
>> (tesseract for flat-bench scanner images with perfect lighting conditions,
>> VGG models for natural images taken by cellphone/digital cameras with weird
>> orientations and lighting conditions).
>>
>> I agree with you, we can make VGG OCR as an optional REST API and allow
>> users to agree their license if they want to use it. Thanks Luis for the
>> feedback :-)
>>
>> Keep up the good work and keep this email thread updated with your
>> findings.
>>
>> Thanks,
>> TG
>>
>> *--*
>> *Thamme Gowda*
>> TG | @thammegowda <https://twitter.com/thammegowda>
>> ~Sent via somebody's Webmail server!
>>
>> On Wed, Apr 19, 2017 at 6:12 AM, Kranthi Kiran G V <
>> [hidden email]> wrote:
>>
>>> Hello community,
>>> I have successfully tested Tesseract 4.0 on various images of different
>>> sizes, orientation and lightening
>>> conditions. I would, in the next few days, publish the results on a blog
>>> for you to have a look at.
>>>
>>> Although I'm able to reliably measure the clock time, accuracy, etc, I am
>>> not able to come up with a method
>>> to reliably measure the memory consumed. Any pointers on this from the
>>> developer community would be
>>> appreciated.
>>>
>>> VGG group has two models released
>>> <http://www.robots.ox.ac.uk/~vgg/research/text/#sec-models>. I'm not
>>> able
>>>
>>> to test any as of now due to no back compatibility with
>>> the MatConvNet used. I use a recent version of MATLAB. As of now, I am
>>> trying to get around it by updating
>>> parts of the code. I'm also contacting the mainters of the repository to
>>> help me address the issues.
>>> I'm hopeful to run them.
>>>
>>> Addressing Luis' concern, we won't be building VGG's models into Tika'
>>> source. We would only be helping
>>> the user deploy a REST API to which Tika's OCR subsystem passes the
>>> images
>>> and retrieve the information
>>> in the form of a string.
>>>
>>> Thank you,
>>> Kranthi Kiran GV,
>>> CS 3/4 Undergrad,
>>> NIT Warangal
>>>
>>> On Tue, Apr 18, 2017 at 8:43 AM, Kranthi Kiran G V <
>>> [hidden email]> wrote:
>>>
>>> > Hello Luis,
>>> > Yes, tesseract 4.0 is not yet a stable release. VGG group's model has a
>>> > 3-clause BSD license.
>>> >
>>> > I see it as a long term effort which would help the Tika's community
>>> > experience near state of art OCR.
>>> >
>>> > This is an investigation into it to see if we can try out this
>>> direction.
>>> > Thanks for expressing your views.
>>> >
>>> > Thank you,
>>> > Kranthi Kiran GV
>>> >
>>> > On Apr 18, 2017 2:44 AM, "Luís Filipe Nassif" <[hidden email]>
>>> wrote:
>>> >
>>> > Hi Kranthi,
>>> >
>>> > That is an interesting comparison! But I think Tesseract 4.0 is still
>>> > alpha? And do you know the VGG software license?
>>> >
>>> > Best,
>>> > Luis
>>> >
>>> > Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" <
>>> > [hidden email]> escreveu:
>>> >
>>> > Hello Tim Allison,
>>> >
>>> > I am currently working on improving Tika's OCR capabilities.
>>> > After suggestion from Thamme Gowda (@thammegowda
>>> > <https://issues.apache.org/jira/secure/ViewProfile.jspa?name
>>> =thammegowda
>>> > >),
>>> > I started to work on comparison of Tesseract 4.0's neural network
>>> > <https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsI
>>> nTesseract4.00
>>> > >
>>> > subsystem and Visual Geometry Group's (VGG) models
>>> > <http://www.robots.ox.ac.uk/~vgg/research/text/>.
>>> >
>>> > It would be great if you provide the dataset to test the OCR as you
>>> > mentioned in one of the issues.
>>> >
>>> > I would be comparing their running time for evaluation, accuracy,
>>> memory
>>> > consumed and invariance to lighting, orientation, etc. And then I
>>> would be
>>> > integrating the appropriate models into Tika's OCR.
>>> >
>>> > Thank you,
>>> > Kranthi Kiran GV,
>>> > CS 3/4 Undergrad,
>>> > NIT Warangal
>>> >
>>> >
>>> >
>>>
>>
>>
>