Japanese Query Unexpectedly Misses

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Japanese Query Unexpectedly Misses

Stephen Lewis Bianamara
Hi SOLR Community,

I have an example of a basic Japanese indexing/recall scenario which I am trying to support, but cannot get to work.

The scenario is: I would like for 日本人 (Japanese Person) to be matched by either 日本 (Japan) or 人 (Person). Currently, I am not seeing this work. My Japanese text field currently has the tokenizer
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
What is most surprising to me is that I though this is what mode="search" was made for. From the docs, I see
Use search mode to get a noun-decompounding effect useful for search. search mode improves segmentation for search at the expense of part-of-speech accuracy

I analyzed the breakdown, and I can see that the tokenizer is not generating three tokens (one for Japan, one for person, and one for Japanese Person) as I would have expected. Interestingly, the tokenizer does recognize that  日本人 is a compound noun, so it would seem to be that it should decompound it (see image below).

Can you help me figure out if my configuration is incorrect, or if there is some way to fix this scenario?

Thanks!
Stephen




Reply | Threaded
Open this post in threaded view
|

Re: Japanese Query Unexpectedly Misses

Yasufumi Mizoguchi
Hi,

There are two solutions as far as I know.

1. Use userDictionary attribute
This is common and safe way I think.
Add userDictionary attribute into your tokenizer configuration and define
userDictionary file as follows.

Tokenizer:
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"
userDictionary="lang/userdict_ja.txt"/>

userDictionary(lang/userdict_ja.txt in above setting):
日本人,日本 人,ニッポン ジン,カスタム名詞

This leads you the result you want.

But, "カスタム名詞"(Customized noun) might not be an appropriate "part of speech"
to your service, threfore
you should change "カスタム名詞" to another "part of speech", I think.


2. Use nBest attribute
If you use 6.0 or higher version solr(maybe...), there is a worth to try
this.
Adding nBestExamples attribute into the tokenizer configuration as follows.

Tokenizer:
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"
nBestExamples="日本人-日本/日本人-日本人"/>

When tokenizing sentences, JapaneseTokenizer assumes various
results(tokenized sentences) and calculates the cost on each results.
And, the tokenizer returns the result having the lowest cost.
Using nBest, JapaneseTokenizer becomes to return the lowest and some other
results.
However, this can affect the result not only the case you want to solve,
but also the others.

And, both way require you to re-indexing all documents with Japanese field
type.


Thanks,
Yasufumi

2019年10月18日(金) 2:44 Stephen Lewis Bianamara <[hidden email]>:

> Hi SOLR Community,
>
> I have an example of a basic Japanese indexing/recall scenario which I am
> trying to support, but cannot get to work.
>
> The scenario is: I would like for 日本人 (Japanese Person) to be matched by
> either 日本 (Japan) or 人 (Person). Currently, I am not seeing this work. My
> Japanese text field currently has the tokenizer
>
>> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>>
> What is most surprising to me is that I though this is what mode="search"
> was made for. From the docs, I see
>
>> Use search mode to get a noun-decompounding effect useful for search.
>> search mode improves segmentation for search at the expense of
>> part-of-speech accuracy
>>
>
> I analyzed the breakdown, and I can see that the tokenizer is not
> generating three tokens (one for Japan, one for person, and one for
> Japanese Person) as I would have expected. Interestingly, the tokenizer
> does recognize that  日本人 is a compound noun, so it would seem to be that it
> should decompound it (see image below).
>
> Can you help me figure out if my configuration is incorrect, or if there
> is some way to fix this scenario?
>
> Thanks!
> Stephen
>
>
> [image: image.png]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Japanese Query Unexpectedly Misses

Stephen Lewis Bianamara
Thank you Yasufumi!

It looks like the userdict_ja.txt could be a good way for us to go.

I wonder though if there is a more generic solution to this problem? E.g.,
has anyone done some research into a list of commonly desired
decompoundings which the Kuormoji statistics miss? I tried searching online
for a comprehensive userdict_ja.txt or more generally, a list of common
japanese decompoundings, but wasn't able to find one. Do you have any
resources you know of which can help find generic solutions for commonly
desired decompoundings?

Thanks,
Stephen

On Fri, Oct 18, 2019 at 3:31 AM Yasufumi Mizoguchi <[hidden email]>
wrote:

> Hi,
>
> There are two solutions as far as I know.
>
> 1. Use userDictionary attribute
> This is common and safe way I think.
> Add userDictionary attribute into your tokenizer configuration and define
> userDictionary file as follows.
>
> Tokenizer:
> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"
> userDictionary="lang/userdict_ja.txt"/>
>
> userDictionary(lang/userdict_ja.txt in above setting):
> 日本人,日本 人,ニッポン ジン,カスタム名詞
>
> This leads you the result you want.
>
> But, "カスタム名詞"(Customized noun) might not be an appropriate "part of speech"
> to your service, threfore
> you should change "カスタム名詞" to another "part of speech", I think.
>
>
> 2. Use nBest attribute
> If you use 6.0 or higher version solr(maybe...), there is a worth to try
> this.
> Adding nBestExamples attribute into the tokenizer configuration as follows.
>
> Tokenizer:
> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"
> nBestExamples="日本人-日本/日本人-日本人"/>
>
> When tokenizing sentences, JapaneseTokenizer assumes various
> results(tokenized sentences) and calculates the cost on each results.
> And, the tokenizer returns the result having the lowest cost.
> Using nBest, JapaneseTokenizer becomes to return the lowest and some other
> results.
> However, this can affect the result not only the case you want to solve,
> but also the others.
>
> And, both way require you to re-indexing all documents with Japanese field
> type.
>
>
> Thanks,
> Yasufumi
>
> 2019年10月18日(金) 2:44 Stephen Lewis Bianamara <[hidden email]>:
>
> > Hi SOLR Community,
> >
> > I have an example of a basic Japanese indexing/recall scenario which I am
> > trying to support, but cannot get to work.
> >
> > The scenario is: I would like for 日本人 (Japanese Person) to be matched by
> > either 日本 (Japan) or 人 (Person). Currently, I am not seeing this work. My
> > Japanese text field currently has the tokenizer
> >
> >> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
> >>
> > What is most surprising to me is that I though this is what mode="search"
> > was made for. From the docs, I see
> >
> >> Use search mode to get a noun-decompounding effect useful for search.
> >> search mode improves segmentation for search at the expense of
> >> part-of-speech accuracy
> >>
> >
> > I analyzed the breakdown, and I can see that the tokenizer is not
> > generating three tokens (one for Japan, one for person, and one for
> > Japanese Person) as I would have expected. Interestingly, the tokenizer
> > does recognize that  日本人 is a compound noun, so it would seem to be that
> it
> > should decompound it (see image below).
> >
> > Can you help me figure out if my configuration is incorrect, or if there
> > is some way to fix this scenario?
> >
> > Thanks!
> > Stephen
> >
> >
> > [image: image.png]
> >
> >
>