Usage of Tika LanguageIdentifier in language-identifier plugin

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Usage of Tika LanguageIdentifier in language-identifier plugin

Yossi Tamari
Hi

 

The language-identifier plugin uses
org.apache.tika.language.LanguageIdentifier for extracting the language from
the document text. There are two issues with that:

1. LanguageIdentifier is deprecated in Tika.
2. It does not support CJK language (and I suspect a lot of other
languages -
https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
_and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
in my experience Chinese was recognized as Italian.

 

Since in Tika LanguageIdentifier was superseded by
org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
that change in the plugin as well. However, because the design of
LanguageDetector is terrible, it makes the implementation not reentrant,
meaning the full language model would have to be reloaded on each call to
the detector.

 

For my needs, I have modified the plugin to use
com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
LanguageDetector uses internally (at least by default). My question is
whether that is a change that should be made to the official plugin.

 

Thanks,

               Yossi.

Reply | Threaded
Open this post in threaded view
|

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Sebastian Nagel
Hi Yossi,

why not port it to use
   http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html

The upgrade to Tika 1.16 is already in progress (NUTCH-2439).

Sebastian

On 10/24/2017 11:26 AM, Yossi Tamari wrote:

> Hi
>
>  
>
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
>
> 1. LanguageIdentifier is deprecated in Tika.
> 2. It does not support CJK language (and I suspect a lot of other
> languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
> _and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
> in my experience Chinese was recognized as Italian.
>
>  
>
> Since in Tika LanguageIdentifier was superseded by
> org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
> that change in the plugin as well. However, because the design of
> LanguageDetector is terrible, it makes the implementation not reentrant,
> meaning the full language model would have to be reloaded on each call to
> the detector.
>
>  
>
> For my needs, I have modified the plugin to use
> com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
> LanguageDetector uses internally (at least by default). My question is
> whether that is a change that should be made to the official plugin.
>
>  
>
> Thanks,
>
>                Yossi.
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Yossi Tamari
Hi Sebastian,

Please reread the second paragraph of my email 😊.
In short, it is not possible to initialize the detector in setConf and then reuse it, and initializing it per call would be extremely slow.

        Yossi.


> -----Original Message-----
> From: Sebastian Nagel [mailto:[hidden email]]
> Sent: 24 October 2017 12:41
> To: [hidden email]
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Yossi,
>
> why not port it to use
>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> tector.html
>
> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>
> Sebastian
>
> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > Hi
> >
> >
> >
> > The language-identifier plugin uses
> > org.apache.tika.language.LanguageIdentifier for extracting the
> > language from the document text. There are two issues with that:
> >
> > 1. LanguageIdentifier is deprecated in Tika.
> > 2. It does not support CJK language (and I suspect a lot of other
> > languages -
> > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> > guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> > with them - in my experience Chinese was recognized as Italian.
> >
> >
> >
> > Since in Tika LanguageIdentifier was superseded by
> > org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> > make that change in the plugin as well. However, because the design of
> > LanguageDetector is terrible, it makes the implementation not
> > reentrant, meaning the full language model would have to be reloaded
> > on each call to the detector.
> >
> >
> >
> > For my needs, I have modified the plugin to use
> > com.optimaize.langdetect.LanguageDetector directly, which is what
> > Tika's LanguageDetector uses internally (at least by default). My
> > question is whether that is a change that should be made to the official plugin.
> >
> >
> >
> > Thanks,
> >
> >                Yossi.
> >
> >


Reply | Threaded
Open this post in threaded view
|

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Sebastian Nagel
Hi Yossi,

sorry while fast-reading I've thought it's about the old LanguageIdentifier.

> it is not possible to initialize the detector in setConf and then reuse it

Could explain why? The API/interface should allow to get an instance and call loadModels() or not?

>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what

Of course, that's also possible. Or just add a plugin language-identifier-optimaize.

Btw., I recently had a look on various open source language identifier implementations would prefer
langid (a port from Python/C) because it's faster and has a better precision:
  https://github.com/carrotsearch/langid-java.git
  https://github.com/saffsd/langid.c.git
  https://github.com/saffsd/langid.py.git
Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's C++).

Thanks,
Sebastian

On 10/24/2017 11:46 AM, Yossi Tamari wrote:

> Hi Sebastian,
>
> Please reread the second paragraph of my email 😊.
> In short, it is not possible to initialize the detector in setConf and then reuse it, and initializing it per call would be extremely slow.
>
> Yossi.
>
>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:[hidden email]]
>> Sent: 24 October 2017 12:41
>> To: [hidden email]
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> why not port it to use
>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>> tector.html
>>
>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>
>> Sebastian
>>
>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>> Hi
>>>
>>>
>>>
>>> The language-identifier plugin uses
>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>> language from the document text. There are two issues with that:
>>>
>>> 1. LanguageIdentifier is deprecated in Tika.
>>> 2. It does not support CJK language (and I suspect a lot of other
>>> languages -
>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>> with them - in my experience Chinese was recognized as Italian.
>>>
>>>
>>>
>>> Since in Tika LanguageIdentifier was superseded by
>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>> make that change in the plugin as well. However, because the design of
>>> LanguageDetector is terrible, it makes the implementation not
>>> reentrant, meaning the full language model would have to be reloaded
>>> on each call to the detector.
>>>
>>>
>>>
>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>> Tika's LanguageDetector uses internally (at least by default). My
>>> question is whether that is a change that should be made to the official plugin.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                Yossi.
>>>
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Yossi Tamari
Why not LanguageDetector: The API does not separate the Detector object, which contains the model and should be reused, from the text writer object, which should be request specific. The same API Object instance contains references to both. In code terms, both loadModels() and addText() are non-static members of LanguageDetector.

Developing another language-identifier-optimaize is basically what I have done locally, but it seems to me having both in the Nutch repository would just be confusing for users. 99% of the code would also be duplicated (the relevant code is about 5 lines).

I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the fact that the project has not seen a single commit in the last 4 years, and the usage numbers are also quite low, gives me pause...


> -----Original Message-----
> From: Sebastian Nagel [mailto:[hidden email]]
> Sent: 24 October 2017 13:18
> To: [hidden email]
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Yossi,
>
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
>
> > it is not possible to initialize the detector in setConf and then reuse it
>
> Could explain why? The API/interface should allow to get an instance and call
> loadModels() or not?
>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
>
> Of course, that's also possible. Or just add a plugin language-identifier-
> optimaize.
>
> Btw., I recently had a look on various open source language identifier
> implementations would prefer
> langid (a port from Python/C) because it's faster and has a better precision:
>   https://github.com/carrotsearch/langid-java.git
>   https://github.com/saffsd/langid.c.git
>   https://github.com/saffsd/langid.py.git
> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> C++).
>
> Thanks,
> Sebastian
>
> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > Hi Sebastian,
> >
> > Please reread the second paragraph of my email 😊.
> > In short, it is not possible to initialize the detector in setConf and then reuse it,
> and initializing it per call would be extremely slow.
> >
> > Yossi.
> >
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:[hidden email]]
> >> Sent: 24 October 2017 12:41
> >> To: [hidden email]
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> why not port it to use
> >>
> >>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >> tector.html
> >>
> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>
> >> Sebastian
> >>
> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>> Hi
> >>>
> >>>
> >>>
> >>> The language-identifier plugin uses
> >>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>> language from the document text. There are two issues with that:
> >>>
> >>> 1. LanguageIdentifier is deprecated in Tika.
> >>> 2. It does not support CJK language (and I suspect a lot of other
> >>> languages -
> >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>> with them - in my experience Chinese was recognized as Italian.
> >>>
> >>>
> >>>
> >>> Since in Tika LanguageIdentifier was superseded by
> >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>> make that change in the plugin as well. However, because the design of
> >>> LanguageDetector is terrible, it makes the implementation not
> >>> reentrant, meaning the full language model would have to be reloaded
> >>> on each call to the detector.
> >>>
> >>>
> >>>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>> Tika's LanguageDetector uses internally (at least by default). My
> >>> question is whether that is a change that should be made to the official
> plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>                Yossi.
> >>>
> >>>
> >
> >


Reply | Threaded
Open this post in threaded view
|

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Sebastian Nagel
Hi Yossi,

> does not separate the Detector object, which contains the model and should be reused, from the
> text writer object, which should be request specific.

But shouldn't a call of reset() make it ready for re-use (the Detector object including the writer)?

But I agree that a reentrant function maybe easier to integrate. Nutch plugins also need to be
thread-safe, esp. parsers and parse filters if running in a multi-threaded parsing fetcher.
Without a reentrant function and without a 100% stateless detector, the only way is to use a
ThreadLocal instance of the detector. At a first glance, the optimaize detecter seems to be stateless.

> I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the
> fact that the project has not seen a single commit in the last 4 years, and the usage numbers are
> also quite low, gives me pause...

Of course, maintenance or community around a project is an important factor. CLD2 is also not really
maintained, plus the models are fixed, no code available to retrain them.

> what I have done locally

In any case, would be great if you would open an issue on Jira and a pull request on github.
Which way to go may be discussed further.

Thanks,
Sebastian


On 10/24/2017 01:05 PM, Yossi Tamari wrote:

> Why not LanguageDetector: The API does not separate the Detector object, which contains the model and should be reused, from the text writer object, which should be request specific. The same API Object instance contains references to both. In code terms, both loadModels() and addText() are non-static members of LanguageDetector.
>
> Developing another language-identifier-optimaize is basically what I have done locally, but it seems to me having both in the Nutch repository would just be confusing for users. 99% of the code would also be duplicated (the relevant code is about 5 lines).
>
> I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the fact that the project has not seen a single commit in the last 4 years, and the usage numbers are also quite low, gives me pause...
>
>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:[hidden email]]
>> Sent: 24 October 2017 13:18
>> To: [hidden email]
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
>>
>>> it is not possible to initialize the detector in setConf and then reuse it
>>
>> Could explain why? The API/interface should allow to get an instance and call
>> loadModels() or not?
>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>
>> Of course, that's also possible. Or just add a plugin language-identifier-
>> optimaize.
>>
>> Btw., I recently had a look on various open source language identifier
>> implementations would prefer
>> langid (a port from Python/C) because it's faster and has a better precision:
>>   https://github.com/carrotsearch/langid-java.git
>>   https://github.com/saffsd/langid.c.git
>>   https://github.com/saffsd/langid.py.git
>> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
>> C++).
>>
>> Thanks,
>> Sebastian
>>
>> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
>>> Hi Sebastian,
>>>
>>> Please reread the second paragraph of my email 😊.
>>> In short, it is not possible to initialize the detector in setConf and then reuse it,
>> and initializing it per call would be extremely slow.
>>>
>>> Yossi.
>>>
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel [mailto:[hidden email]]
>>>> Sent: 24 October 2017 12:41
>>>> To: [hidden email]
>>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>>>
>>>> Hi Yossi,
>>>>
>>>> why not port it to use
>>>>
>>>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>>>> tector.html
>>>>
>>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>>>
>>>> Sebastian
>>>>
>>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>>>> Hi
>>>>>
>>>>>
>>>>>
>>>>> The language-identifier plugin uses
>>>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>>>> language from the document text. There are two issues with that:
>>>>>
>>>>> 1. LanguageIdentifier is deprecated in Tika.
>>>>> 2. It does not support CJK language (and I suspect a lot of other
>>>>> languages -
>>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>>>> with them - in my experience Chinese was recognized as Italian.
>>>>>
>>>>>
>>>>>
>>>>> Since in Tika LanguageIdentifier was superseded by
>>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>>>> make that change in the plugin as well. However, because the design of
>>>>> LanguageDetector is terrible, it makes the implementation not
>>>>> reentrant, meaning the full language model would have to be reloaded
>>>>> on each call to the detector.
>>>>>
>>>>>
>>>>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>>>> Tika's LanguageDetector uses internally (at least by default). My
>>>>> question is whether that is a change that should be made to the official
>> plugin.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>                Yossi.
>>>>>
>>>>>
>>>
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Markus Jelsma-2
In reply to this post by Yossi Tamari
Hello,

Not sure what the problem is but , buried  deep in our parser we also use Optimaize, previously lang-detect. We load models once, inside a static block, and create a new Detector instance for every record we parse. This is very fast.

Regards,
Markus
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Tuesday 24th October 2017 14:11
> To: [hidden email]
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Yossi,
>
> > does not separate the Detector object, which contains the model and should be reused, from the
> > text writer object, which should be request specific.
>
> But shouldn't a call of reset() make it ready for re-use (the Detector object including the writer)?
>
> But I agree that a reentrant function maybe easier to integrate. Nutch plugins also need to be
> thread-safe, esp. parsers and parse filters if running in a multi-threaded parsing fetcher.
> Without a reentrant function and without a 100% stateless detector, the only way is to use a
> ThreadLocal instance of the detector. At a first glance, the optimaize detecter seems to be stateless.
>
> > I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the
> > fact that the project has not seen a single commit in the last 4 years, and the usage numbers are
> > also quite low, gives me pause...
>
> Of course, maintenance or community around a project is an important factor. CLD2 is also not really
> maintained, plus the models are fixed, no code available to retrain them.
>
> > what I have done locally
>
> In any case, would be great if you would open an issue on Jira and a pull request on github.
> Which way to go may be discussed further.
>
> Thanks,
> Sebastian
>
>
> On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > Why not LanguageDetector: The API does not separate the Detector object, which contains the model and should be reused, from the text writer object, which should be request specific. The same API Object instance contains references to both. In code terms, both loadModels() and addText() are non-static members of LanguageDetector.
> >
> > Developing another language-identifier-optimaize is basically what I have done locally, but it seems to me having both in the Nutch repository would just be confusing for users. 99% of the code would also be duplicated (the relevant code is about 5 lines).
> >
> > I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the fact that the project has not seen a single commit in the last 4 years, and the usage numbers are also quite low, gives me pause...
> >
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:[hidden email]]
> >> Sent: 24 October 2017 13:18
> >> To: [hidden email]
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> >>
> >>> it is not possible to initialize the detector in setConf and then reuse it
> >>
> >> Could explain why? The API/interface should allow to get an instance and call
> >> loadModels() or not?
> >>
> >>>>> For my needs, I have modified the plugin to use
> >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>
> >> Of course, that's also possible. Or just add a plugin language-identifier-
> >> optimaize.
> >>
> >> Btw., I recently had a look on various open source language identifier
> >> implementations would prefer
> >> langid (a port from Python/C) because it's faster and has a better precision:
> >>   https://github.com/carrotsearch/langid-java.git
> >>   https://github.com/saffsd/langid.c.git
> >>   https://github.com/saffsd/langid.py.git
> >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> >> C++).
> >>
> >> Thanks,
> >> Sebastian
> >>
> >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> >>> Hi Sebastian,
> >>>
> >>> Please reread the second paragraph of my email .
> >>> In short, it is not possible to initialize the detector in setConf and then reuse it,
> >> and initializing it per call would be extremely slow.
> >>>
> >>> Yossi.
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Sebastian Nagel [mailto:[hidden email]]
> >>>> Sent: 24 October 2017 12:41
> >>>> To: [hidden email]
> >>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>>>
> >>>> Hi Yossi,
> >>>>
> >>>> why not port it to use
> >>>>
> >>>>
> >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >>>> tector.html
> >>>>
> >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>>>
> >>>> Sebastian
> >>>>
> >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>>>> Hi
> >>>>>
> >>>>>
> >>>>>
> >>>>> The language-identifier plugin uses
> >>>>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>>>> language from the document text. There are two issues with that:
> >>>>>
> >>>>> 1. LanguageIdentifier is deprecated in Tika.
> >>>>> 2. It does not support CJK language (and I suspect a lot of other
> >>>>> languages -
> >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>>>> with them - in my experience Chinese was recognized as Italian.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Since in Tika LanguageIdentifier was superseded by
> >>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>>>> make that change in the plugin as well. However, because the design of
> >>>>> LanguageDetector is terrible, it makes the implementation not
> >>>>> reentrant, meaning the full language model would have to be reloaded
> >>>>> on each call to the detector.
> >>>>>
> >>>>>
> >>>>>
> >>>>> For my needs, I have modified the plugin to use
> >>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>>>> Tika's LanguageDetector uses internally (at least by default). My
> >>>>> question is whether that is a change that should be made to the official
> >> plugin.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>                Yossi.
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Yossi Tamari
Hi Markus,

Can you please explain what do you mean by "our parser", because I'm pretty sure the language-identifier plugin is not using Optimaize.

Thanks,
        Yossi.

> -----Original Message-----
> From: Markus Jelsma [mailto:[hidden email]]
> Sent: 24 October 2017 15:25
> To: [hidden email]
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hello,
>
> Not sure what the problem is but , buried  deep in our parser we also use
> Optimaize, previously lang-detect. We load models once, inside a static block,
> and create a new Detector instance for every record we parse. This is very fast.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Sebastian Nagel <[hidden email]>
> > Sent: Tuesday 24th October 2017 14:11
> > To: [hidden email]
> > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > plugin
> >
> > Hi Yossi,
> >
> > > does not separate the Detector object, which contains the model and
> > > should be reused, from the text writer object, which should be request
> specific.
> >
> > But shouldn't a call of reset() make it ready for re-use (the Detector object
> including the writer)?
> >
> > But I agree that a reentrant function maybe easier to integrate. Nutch
> > plugins also need to be thread-safe, esp. parsers and parse filters if running in
> a multi-threaded parsing fetcher.
> > Without a reentrant function and without a 100% stateless detector,
> > the only way is to use a ThreadLocal instance of the detector. At a first glance,
> the optimaize detecter seems to be stateless.
> >
> > > I chose optimaize mainly because Tika did. Using langid instead
> > > should be very simple, but the fact that the project has not seen a
> > > single commit in the last 4 years, and the usage numbers are also quite low,
> gives me pause...
> >
> > Of course, maintenance or community around a project is an important
> > factor. CLD2 is also not really maintained, plus the models are fixed, no code
> available to retrain them.
> >
> > > what I have done locally
> >
> > In any case, would be great if you would open an issue on Jira and a pull
> request on github.
> > Which way to go may be discussed further.
> >
> > Thanks,
> > Sebastian
> >
> >
> > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > Why not LanguageDetector: The API does not separate the Detector object,
> which contains the model and should be reused, from the text writer object,
> which should be request specific. The same API Object instance contains
> references to both. In code terms, both loadModels() and addText() are non-
> static members of LanguageDetector.
> > >
> > > Developing another language-identifier-optimaize is basically what I have
> done locally, but it seems to me having both in the Nutch repository would just
> be confusing for users. 99% of the code would also be duplicated (the relevant
> code is about 5 lines).
> > >
> > > I chose optimaize mainly because Tika did. Using langid instead should be
> very simple, but the fact that the project has not seen a single commit in the last
> 4 years, and the usage numbers are also quite low, gives me pause...
> > >
> > >
> > >> -----Original Message-----
> > >> From: Sebastian Nagel [mailto:[hidden email]]
> > >> Sent: 24 October 2017 13:18
> > >> To: [hidden email]
> > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > >> language-identifier plugin
> > >>
> > >> Hi Yossi,
> > >>
> > >> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> > >>
> > >>> it is not possible to initialize the detector in setConf and then
> > >>> reuse it
> > >>
> > >> Could explain why? The API/interface should allow to get an
> > >> instance and call
> > >> loadModels() or not?
> > >>
> > >>>>> For my needs, I have modified the plugin to use
> > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > >>>>> what
> > >>
> > >> Of course, that's also possible. Or just add a plugin
> > >> language-identifier- optimaize.
> > >>
> > >> Btw., I recently had a look on various open source language
> > >> identifier implementations would prefer langid (a port from
> > >> Python/C) because it's faster and has a better precision:
> > >>   https://github.com/carrotsearch/langid-java.git
> > >>   https://github.com/saffsd/langid.c.git
> > >>   https://github.com/saffsd/langid.py.git
> > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is
> > >> unbeaten (but it's
> > >> C++).
> > >>
> > >> Thanks,
> > >> Sebastian
> > >>
> > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > >>> Hi Sebastian,
> > >>>
> > >>> Please reread the second paragraph of my email .
> > >>> In short, it is not possible to initialize the detector in setConf
> > >>> and then reuse it,
> > >> and initializing it per call would be extremely slow.
> > >>>
> > >>> Yossi.
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Sebastian Nagel [mailto:[hidden email]]
> > >>>> Sent: 24 October 2017 12:41
> > >>>> To: [hidden email]
> > >>>> Subject: Re: Usage of Tika LanguageIdentifier in
> > >>>> language-identifier plugin
> > >>>>
> > >>>> Hi Yossi,
> > >>>>
> > >>>> why not port it to use
> > >>>>
> > >>>>
> > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/Lan
> > >> guageDe
> > >>>> tector.html
> > >>>>
> > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> > >>>>
> > >>>> Sebastian
> > >>>>
> > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > >>>>> Hi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> The language-identifier plugin uses
> > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the
> > >>>>> language from the document text. There are two issues with that:
> > >>>>>
> > >>>>> 1. LanguageIdentifier is deprecated in Tika.
> > >>>>> 2. It does not support CJK language (and I suspect a lot of other
> > >>>>> languages -
> > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implement
> > >>>>> ed_Lan guages _and_their_ISO_636_Codes), and it doesn't even
> > >>>>> fail gracefully with them - in my experience Chinese was
> > >>>>> recognized as Italian.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Since in Tika LanguageIdentifier was superseded by
> > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems
> > >>>>> obvious to make that change in the plugin as well. However,
> > >>>>> because the design of LanguageDetector is terrible, it makes the
> > >>>>> implementation not reentrant, meaning the full language model
> > >>>>> would have to be reloaded on each call to the detector.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> For my needs, I have modified the plugin to use
> > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > >>>>> what Tika's LanguageDetector uses internally (at least by
> > >>>>> default). My question is whether that is a change that should be
> > >>>>> made to the official
> > >> plugin.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>>                Yossi.
> > >>>>>
> > >>>>>
> > >>>
> > >>>
> > >
> > >
> >
> >

Reply | Threaded
Open this post in threaded view
|

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Markus Jelsma-2
In reply to this post by Yossi Tamari
Hello,

Sorry, i didn't say that as Nutch committer. Our parser at Openindex has Optimaize deep under the hood, and it is fast!

Regards,
Markus
 
-----Original message-----

> From:Yossi Tamari <[hidden email]>
> Sent: Tuesday 24th October 2017 14:46
> To: [hidden email]
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
>
> Hi Markus,
>
> Can you please explain what do you mean by "our parser", because I'm pretty sure the language-identifier plugin is not using Optimaize.
>
> Thanks,
> Yossi.
>
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[hidden email]]
> > Sent: 24 October 2017 15:25
> > To: [hidden email]
> > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> >
> > Hello,
> >
> > Not sure what the problem is but , buried  deep in our parser we also use
> > Optimaize, previously lang-detect. We load models once, inside a static block,
> > and create a new Detector instance for every record we parse. This is very fast.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Sebastian Nagel <[hidden email]>
> > > Sent: Tuesday 24th October 2017 14:11
> > > To: [hidden email]
> > > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > > plugin
> > >
> > > Hi Yossi,
> > >
> > > > does not separate the Detector object, which contains the model and
> > > > should be reused, from the text writer object, which should be request
> > specific.
> > >
> > > But shouldn't a call of reset() make it ready for re-use (the Detector object
> > including the writer)?
> > >
> > > But I agree that a reentrant function maybe easier to integrate. Nutch
> > > plugins also need to be thread-safe, esp. parsers and parse filters if running in
> > a multi-threaded parsing fetcher.
> > > Without a reentrant function and without a 100% stateless detector,
> > > the only way is to use a ThreadLocal instance of the detector. At a first glance,
> > the optimaize detecter seems to be stateless.
> > >
> > > > I chose optimaize mainly because Tika did. Using langid instead
> > > > should be very simple, but the fact that the project has not seen a
> > > > single commit in the last 4 years, and the usage numbers are also quite low,
> > gives me pause...
> > >
> > > Of course, maintenance or community around a project is an important
> > > factor. CLD2 is also not really maintained, plus the models are fixed, no code
> > available to retrain them.
> > >
> > > > what I have done locally
> > >
> > > In any case, would be great if you would open an issue on Jira and a pull
> > request on github.
> > > Which way to go may be discussed further.
> > >
> > > Thanks,
> > > Sebastian
> > >
> > >
> > > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > > Why not LanguageDetector: The API does not separate the Detector object,
> > which contains the model and should be reused, from the text writer object,
> > which should be request specific. The same API Object instance contains
> > references to both. In code terms, both loadModels() and addText() are non-
> > static members of LanguageDetector.
> > > >
> > > > Developing another language-identifier-optimaize is basically what I have
> > done locally, but it seems to me having both in the Nutch repository would just
> > be confusing for users. 99% of the code would also be duplicated (the relevant
> > code is about 5 lines).
> > > >
> > > > I chose optimaize mainly because Tika did. Using langid instead should be
> > very simple, but the fact that the project has not seen a single commit in the last
> > 4 years, and the usage numbers are also quite low, gives me pause...
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Sebastian Nagel [mailto:[hidden email]]
> > > >> Sent: 24 October 2017 13:18
> > > >> To: [hidden email]
> > > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > > >> language-identifier plugin
> > > >>
> > > >> Hi Yossi,
> > > >>
> > > >> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> > > >>
> > > >>> it is not possible to initialize the detector in setConf and then
> > > >>> reuse it
> > > >>
> > > >> Could explain why? The API/interface should allow to get an
> > > >> instance and call
> > > >> loadModels() or not?
> > > >>
> > > >>>>> For my needs, I have modified the plugin to use
> > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > > >>>>> what
> > > >>
> > > >> Of course, that's also possible. Or just add a plugin
> > > >> language-identifier- optimaize.
> > > >>
> > > >> Btw., I recently had a look on various open source language
> > > >> identifier implementations would prefer langid (a port from
> > > >> Python/C) because it's faster and has a better precision:
> > > >>   https://github.com/carrotsearch/langid-java.git
> > > >>   https://github.com/saffsd/langid.c.git
> > > >>   https://github.com/saffsd/langid.py.git
> > > >> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is
> > > >> unbeaten (but it's
> > > >> C++).
> > > >>
> > > >> Thanks,
> > > >> Sebastian
> > > >>
> > > >> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > > >>> Hi Sebastian,
> > > >>>
> > > >>> Please reread the second paragraph of my email .
> > > >>> In short, it is not possible to initialize the detector in setConf
> > > >>> and then reuse it,
> > > >> and initializing it per call would be extremely slow.
> > > >>>
> > > >>> Yossi.
> > > >>>
> > > >>>
> > > >>>> -----Original Message-----
> > > >>>> From: Sebastian Nagel [mailto:[hidden email]]
> > > >>>> Sent: 24 October 2017 12:41
> > > >>>> To: [hidden email]
> > > >>>> Subject: Re: Usage of Tika LanguageIdentifier in
> > > >>>> language-identifier plugin
> > > >>>>
> > > >>>> Hi Yossi,
> > > >>>>
> > > >>>> why not port it to use
> > > >>>>
> > > >>>>
> > > >> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/Lan
> > > >> guageDe
> > > >>>> tector.html
> > > >>>>
> > > >>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> > > >>>>
> > > >>>> Sebastian
> > > >>>>
> > > >>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > > >>>>> Hi
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> The language-identifier plugin uses
> > > >>>>> org.apache.tika.language.LanguageIdentifier for extracting the
> > > >>>>> language from the document text. There are two issues with that:
> > > >>>>>
> > > >>>>> 1. LanguageIdentifier is deprecated in Tika.
> > > >>>>> 2. It does not support CJK language (and I suspect a lot of other
> > > >>>>> languages -
> > > >>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implement
> > > >>>>> ed_Lan guages _and_their_ISO_636_Codes), and it doesn't even
> > > >>>>> fail gracefully with them - in my experience Chinese was
> > > >>>>> recognized as Italian.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Since in Tika LanguageIdentifier was superseded by
> > > >>>>> org.apache.tika.language.detect.LanguageDetector, it seems
> > > >>>>> obvious to make that change in the plugin as well. However,
> > > >>>>> because the design of LanguageDetector is terrible, it makes the
> > > >>>>> implementation not reentrant, meaning the full language model
> > > >>>>> would have to be reloaded on each call to the detector.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> For my needs, I have modified the plugin to use
> > > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > > >>>>> what Tika's LanguageDetector uses internally (at least by
> > > >>>>> default). My question is whether that is a change that should be
> > > >>>>> made to the official
> > > >> plugin.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>>                Yossi.
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >
> > > >
> > >
> > >
>
>