Hunspell stemmer generates multiple tokens

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Hunspell stemmer generates multiple tokens

Luca Cavanna
Hi,
I just noticed that the HunspellStemmer outputs more than one tokens, the
original word plus the stems as far as I understood.

This is not quite what I would expect and becomes tricky especially at
query time. Using for instance elasticsearch to query a stemmed field, a
boolean query would be generated, containing multiple clauses (one for each
token generated by the stemmer) instead of just a clause with the stem that
we expect to find in the index (if we indexed using stemming of course).

I would like to know if you think this is the correct behaviour and if this
is something you are aware of. If I look at snowball for example, I see
that only one token is generated.


Thanks,
Luca
Reply | Threaded
Open this post in threaded view
|

Re: Hunspell stemmer generates multiple tokens

Oren Bochman
Multiple tokens seems to be a more flexible contract.

You might want to be able to match just the stem, both the exact token and  the stemmed token or just the exact term. So putting both in the index may be expedient, depending on the language.

Also there are  a number of common situations where document text can be stemmed more  accurately than query text. In such cases you might want to boost the stemmed token adaptively.

Sent from my iPhone

On Jun 7, 2013, at 16:16, Luca Cavanna <[hidden email]> wrote:

> Hi,
> I just noticed that the HunspellStemmer outputs more than one tokens, the
> original word plus the stems as far as I understood.
>
> This is not quite what I would expect and becomes tricky especially at
> query time. Using for instance elasticsearch to query a stemmed field, a
> boolean query would be generated, containing multiple clauses (one for each
> token generated by the stemmer) instead of just a clause with the stem that
> we expect to find in the index (if we indexed using stemming of course).
>
> I would like to know if you think this is the correct behaviour and if this
> is something you are aware of. If I look at snowball for example, I see
> that only one token is generated.
>
>
> Thanks,
> Luca

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]