Chinese and Korea being detected as Lithuanian by LanguageDetector

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Chinese and Korea being detected as Lithuanian by LanguageDetector

Mike Thomsen
I wrote a Groovy script (attached) to test a bunch of languages against the LanguageDetector class, and these were the results:

ar    fa
de    de
en    en
es    es
fr    fr
gr    el
it    it
ko    lt
nl    nl
ru    ru
zh    lt

Is there something that needs to be done to enable the detection of Asian languages or should I file this as a bug report?

Thanks,

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

kkrugler
Hi Mike,

I don’t see the script - did it get stripped?

Below is a list of the language profiles that I believe are bundled with the language-detector jar that’s pulled in by Tika.

I don’t see “gr” - note that Greek is “el”.

And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect detection to work for your test cases.

— Ken

af
an
ar
ast
be
bg
bn
br
ca
cs
cy
da
de
el
en
es
et
eu
fa
fi
fr
ga
gl
gu
he
hi
hr
ht
hu
id
is
it
ja
km
kn
ko
lt
lv
mk
ml
mr
ms
mt
ne
nl
no
oc
pa
pl
pt
ro
ru
sk
sl
so
sq
sr
sv
sw
ta
te
th
tl
tr
uk
ur
vi
yi
zh-CN
zh-TW


> On Jan 17, 2019, at 9:39 AM, Mike Thomsen <[hidden email]> wrote:
>
> I wrote a Groovy script (attached) to test a bunch of languages against the LanguageDetector class, and these were the results:
>
> ar    fa
> de    de
> en    en
> es    es
> fr    fr
> gr    el
> it    it
> ko    lt
> nl    nl
> ru    ru
> zh    lt
>
> Is there something that needs to be done to enable the detection of Asian languages or should I file this as a bug report?
>
> Thanks,
>
> Mike

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

Mike Thomsen
Ken,

Here's a Gist version of it:

https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b

Thanks,

Mike

On Thu, Jan 17, 2019 at 2:25 PM Ken Krugler <[hidden email]>
wrote:

> Hi Mike,
>
> I don’t see the script - did it get stripped?
>
> Below is a list of the language profiles that I believe are bundled with
> the language-detector jar that’s pulled in by Tika.
>
> I don’t see “gr” - note that Greek is “el”.
>
> And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect
> detection to work for your test cases.
>
> — Ken
>
> af
> an
> ar
> ast
> be
> bg
> bn
> br
> ca
> cs
> cy
> da
> de
> el
> en
> es
> et
> eu
> fa
> fi
> fr
> ga
> gl
> gu
> he
> hi
> hr
> ht
> hu
> id
> is
> it
> ja
> km
> kn
> ko
> lt
> lv
> mk
> ml
> mr
> ms
> mt
> ne
> nl
> no
> oc
> pa
> pl
> pt
> ro
> ru
> sk
> sl
> so
> sq
> sr
> sv
> sw
> ta
> te
> th
> tl
> tr
> uk
> ur
> vi
> yi
> zh-CN
> zh-TW
>
>
> > On Jan 17, 2019, at 9:39 AM, Mike Thomsen <[hidden email]>
> wrote:
> >
> > I wrote a Groovy script (attached) to test a bunch of languages against
> the LanguageDetector class, and these were the results:
> >
> > ar    fa
> > de    de
> > en    en
> > es    es
> > fr    fr
> > gr    el
> > it    it
> > ko    lt
> > nl    nl
> > ru    ru
> > zh    lt
> >
> > Is there something that needs to be done to enable the detection of
> Asian languages or should I file this as a bug report?
> >
> > Thanks,
> >
> > Mike
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Chinese and Korea being detected as Lithuanian by LanguageDetector

kkrugler
Hi Mike,

So the issues are Arabic, Korean and Chinese, right?

I’d suggest filing an issue for Tika, so at least we can track it, though likely the issue is with the language-detector project we’re using for detection.

I’m leaving on a trip this evening, but back next week, so will try to look at it then.

Regards,

— Ken


> On Jan 17, 2019, at 1:48 PM, Mike Thomsen <[hidden email]> wrote:
>
> Ken,
>
> Here's a Gist version of it:
>
> https://gist.github.com/MikeThomsen/84abb89aab903a8b21d64af532cc369b
>
> Thanks,
>
> Mike
>
> On Thu, Jan 17, 2019 at 2:25 PM Ken Krugler <[hidden email]>
> wrote:
>
>> Hi Mike,
>>
>> I don’t see the script - did it get stripped?
>>
>> Below is a list of the language profiles that I believe are bundled with
>> the language-detector jar that’s pulled in by Tika.
>>
>> I don’t see “gr” - note that Greek is “el”.
>>
>> And there’s “zh-CN” and “zh-TW” vs. just “zh”, but otherwise I’d expect
>> detection to work for your test cases.
>>
>> — Ken
>>
>> af
>> an
>> ar
>> ast
>> be
>> bg
>> bn
>> br
>> ca
>> cs
>> cy
>> da
>> de
>> el
>> en
>> es
>> et
>> eu
>> fa
>> fi
>> fr
>> ga
>> gl
>> gu
>> he
>> hi
>> hr
>> ht
>> hu
>> id
>> is
>> it
>> ja
>> km
>> kn
>> ko
>> lt
>> lv
>> mk
>> ml
>> mr
>> ms
>> mt
>> ne
>> nl
>> no
>> oc
>> pa
>> pl
>> pt
>> ro
>> ru
>> sk
>> sl
>> so
>> sq
>> sr
>> sv
>> sw
>> ta
>> te
>> th
>> tl
>> tr
>> uk
>> ur
>> vi
>> yi
>> zh-CN
>> zh-TW
>>
>>
>>> On Jan 17, 2019, at 9:39 AM, Mike Thomsen <[hidden email]>
>> wrote:
>>>
>>> I wrote a Groovy script (attached) to test a bunch of languages against
>> the LanguageDetector class, and these were the results:
>>>
>>> ar    fa
>>> de    de
>>> en    en
>>> es    es
>>> fr    fr
>>> gr    el
>>> it    it
>>> ko    lt
>>> nl    nl
>>> ru    ru
>>> zh    lt
>>>
>>> Is there something that needs to be done to enable the detection of
>> Asian languages or should I file this as a bug report?
>>>
>>> Thanks,
>>>
>>> Mike
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>>
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra