Wrong language detection in tika server 1.22

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Wrong language detection in tika server 1.22

Juan Elosua
Hi all,

Since this is my first email allow me to give some context: my name is Juan
Elosua and I have come across tika for document parsing for an information
security project we are working on.

First of all sorry if this is not the way to send potential issues along
but I was unsure how to communicate them.

The potential issue I found concerns tika-server version 1.22 and more
precisely the language detector interface.

If I send a PDF document to that endpoint it returns *'th' (thai) *as the
detected language but the pdf document is in spanish. I have converted the
pdf to a plain text file (using pdftotext) and rerun the test and then the
language has been detected correctly as *'es'*






*$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
http://localhost:9998/language/stream
<http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
@BOE-A-2019-9455.txt http://localhost:9998/language/stream
<http://localhost:9998/language/stream>es*

I have used a publicly available pdf file to ease the replication, you can
find the original document here:
https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf

Please, let me know what's the best way to report issues.

Saw the "reporting issues" docs for tika, but should I create an account in
order to report the issues or is that something internal to the core team?

Thanks in advance

Juan
Reply | Threaded
Open this post in threaded view
|

Re: Wrong language detection in tika server 1.22

Tim Allison
In looking at the source code for this (for the first time?)...it looks
like that endpoint expects UTF-8 text.  It does not parse the file and then
run lang id on the parsed text.

On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[hidden email]> wrote:

> Hi all,
>
> Since this is my first email allow me to give some context: my name is Juan
> Elosua and I have come across tika for document parsing for an information
> security project we are working on.
>
> First of all sorry if this is not the way to send potential issues along
> but I was unsure how to communicate them.
>
> The potential issue I found concerns tika-server version 1.22 and more
> precisely the language detector interface.
>
> If I send a PDF document to that endpoint it returns *'th' (thai) *as the
> detected language but the pdf document is in spanish. I have converted the
> pdf to a plain text file (using pdftotext) and rerun the test and then the
> language has been detected correctly as *'es'*
>
>
>
>
>
>
> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
> http://localhost:9998/language/stream
> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
> <http://localhost:9998/language/stream>es*
>
> I have used a publicly available pdf file to ease the replication, you can
> find the original document here:
> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
>
> Please, let me know what's the best way to report issues.
>
> Saw the "reporting issues" docs for tika, but should I create an account in
> order to report the issues or is that something internal to the core team?
>
> Thanks in advance
>
> Juan
>
Reply | Threaded
Open this post in threaded view
|

Re: Wrong language detection in tika server 1.22

Tim Allison
I just updated our wiki.  Please let me know if we can improve it further.

https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource

On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <[hidden email]> wrote:

> In looking at the source code for this (for the first time?)...it looks
> like that endpoint expects UTF-8 text.  It does not parse the file and then
> run lang id on the parsed text.
>
> On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[hidden email]> wrote:
>
>> Hi all,
>>
>> Since this is my first email allow me to give some context: my name is
>> Juan
>> Elosua and I have come across tika for document parsing for an information
>> security project we are working on.
>>
>> First of all sorry if this is not the way to send potential issues along
>> but I was unsure how to communicate them.
>>
>> The potential issue I found concerns tika-server version 1.22 and more
>> precisely the language detector interface.
>>
>> If I send a PDF document to that endpoint it returns *'th' (thai) *as the
>> detected language but the pdf document is in spanish. I have converted the
>> pdf to a plain text file (using pdftotext) and rerun the test and then the
>> language has been detected correctly as *'es'*
>>
>>
>>
>>
>>
>>
>> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
>> http://localhost:9998/language/stream
>> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
>> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
>> <http://localhost:9998/language/stream>es*
>>
>> I have used a publicly available pdf file to ease the replication, you can
>> find the original document here:
>> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
>>
>> Please, let me know what's the best way to report issues.
>>
>> Saw the "reporting issues" docs for tika, but should I create an account
>> in
>> order to report the issues or is that something internal to the core team?
>>
>> Thanks in advance
>>
>> Juan
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Wrong language detection in tika server 1.22

Juan Elosua
Hi Tim,

Understood, so the only difference between the /stream and /string endpoint
is the bytestream to UTF-8 conversion.

With the change on the wiki is more clear that the file parsing is limited
to that.

Thank you

Cheers

Juan

On Thu, Dec 5, 2019, 17:21 Tim Allison <[hidden email]> wrote:

> I just updated our wiki.  Please let me know if we can improve it further.
>
>
> https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource
>
> On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <[hidden email]> wrote:
>
> > In looking at the source code for this (for the first time?)...it looks
> > like that endpoint expects UTF-8 text.  It does not parse the file and
> then
> > run lang id on the parsed text.
> >
> > On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[hidden email]>
> wrote:
> >
> >> Hi all,
> >>
> >> Since this is my first email allow me to give some context: my name is
> >> Juan
> >> Elosua and I have come across tika for document parsing for an
> information
> >> security project we are working on.
> >>
> >> First of all sorry if this is not the way to send potential issues along
> >> but I was unsure how to communicate them.
> >>
> >> The potential issue I found concerns tika-server version 1.22 and more
> >> precisely the language detector interface.
> >>
> >> If I send a PDF document to that endpoint it returns *'th' (thai) *as
> the
> >> detected language but the pdf document is in spanish. I have converted
> the
> >> pdf to a plain text file (using pdftotext) and rerun the test and then
> the
> >> language has been detected correctly as *'es'*
> >>
> >>
> >>
> >>
> >>
> >>
> >> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
> >> http://localhost:9998/language/stream
> >> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
> >> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
> >> <http://localhost:9998/language/stream>es*
> >>
> >> I have used a publicly available pdf file to ease the replication, you
> can
> >> find the original document here:
> >> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
> >>
> >> Please, let me know what's the best way to report issues.
> >>
> >> Saw the "reporting issues" docs for tika, but should I create an account
> >> in
> >> order to report the issues or is that something internal to the core
> team?
> >>
> >> Thanks in advance
> >>
> >> Juan
> >>
> >
>