KStem custom lexicons configuration possible?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

KStem custom lexicons configuration possible?

Lukáš Vlček
Hi,

Is there any API in KStem filter for lexicons configuration?

As far as I understand the original code works in such a way that lexicons are loaded from files at startup (see http://lexicalresearch.com/kstem-doc.txt). The author (Robert Krovetz) names possibility to modify lexicons among advantages of KStem compared to other stemmers.

Do people not need it? Would it be a useful addition for KStem filter to allow custom lexicon configurations in its API?

Regards,
Lukas

Note: Big kudos to all who participated in bringing KStem into Lucene!
Reply | Threaded
Open this post in threaded view
|

Re: KStem custom lexicons configuration possible?

Lukáš Vlček
May be I should show some examples where I think custom configuration can be useful. Let me give you two examples:

1) As of now, KStem does conflation of both words "connector" and "connected" to the same term "connect".
2) Contrary it does not do conflation of "transaction" and "transactions" to the same term.

Having an option to modify internal lexicons I would be able to adapt the KStem to work better for specific text corpora.

What do you think?

Regards,
Lukas

On Mon, Jun 20, 2011 at 12:55 PM, Lukáš Vlček <[hidden email]> wrote:
Hi,

Is there any API in KStem filter for lexicons configuration?

As far as I understand the original code works in such a way that lexicons are loaded from files at startup (see http://lexicalresearch.com/kstem-doc.txt). The author (Robert Krovetz) names possibility to modify lexicons among advantages of KStem compared to other stemmers.

Do people not need it? Would it be a useful addition for KStem filter to allow custom lexicon configurations in its API?

Regards,
Lukas

Note: Big kudos to all who participated in bringing KStem into Lucene!

Reply | Threaded
Open this post in threaded view
|

Re: KStem custom lexicons configuration possible?

Robert Muir
On Mon, Jun 20, 2011 at 7:19 AM, Lukáš Vlček <[hidden email]> wrote:
> Having an option to modify internal lexicons I would be able to adapt the
> KStem to work better for specific text corpora.
> What do you think?

please use StemmerOverrideFilter for this! it works with all stemmers,
including this one.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: KStem custom lexicons configuration possible?

Lukáš Vlček
Hi Robert,

this sounds interesting I will look at it in more detail.

However, I do not think this is really a general solution. If I understand StemmerOverrideFilter correctly (from a quick glance) it rely on the fact that you *know* exact term (the key in the map) in advance. In other words if I wanted to "fix" some term produced by Kstem filter I would have to know what is the product of the stemming in advance. Now, this means that if I switch to snowball or porter or other stemmer instead of KStem or simply update something else in the filtering chain then I am in trouble. Also if I understand correctly the original KStem implementation it can still get updates to lexicons which means that once these updates are ported to Java implementation it can again result in problem with existing override filter setup.

More generally, is there any reason why lexicons are not configurable in KStem filter?

Regards,
Lukas

On Mon, Jun 20, 2011 at 1:38 PM, Robert Muir <[hidden email]> wrote:
On Mon, Jun 20, 2011 at 7:19 AM, Lukáš Vlček <[hidden email]> wrote:
> Having an option to modify internal lexicons I would be able to adapt the
> KStem to work better for specific text corpora.
> What do you think?

please use StemmerOverrideFilter for this! it works with all stemmers,
including this one.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: KStem custom lexicons configuration possible?

Robert Muir
On Mon, Jun 20, 2011 at 8:23 AM, Lukáš Vlček <[hidden email]> wrote:

> Hi Robert,
> this sounds interesting I will look at it in more detail.
> However, I do not think this is really a general solution. If I understand
> StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
> that you *know* exact term (the key in the map) in advance. In other words
> if I wanted to "fix" some term produced by Kstem filter I would have to know
> what is the product of the stemming in advance. Now, this means that if I
> switch to snowball or porter or other stemmer instead of KStem or simply
> update something else in the filtering chain then I am in trouble. Also if I
> understand correctly the original KStem implementation it can still get
> updates to lexicons which means that once these updates are ported to Java
> implementation it can again result in problem with existing override filter
> setup.
> More generally, is there any reason why lexicons are not configurable in

Because we have StemmerOverrideFilter and KeywordMarkerFilter.

look at the source code to Kstem: it uses maps and sets of exceptions,
this is what these filters provide in a general way
(StemmerOverrideFilter being the map, and KeywordMarkerFilter being
the set).

we added these to work across the board with all lucene stemmers for
this reason.

I don't understand your concerns at all to be honest, they make no
sense to me. If we "updated" kstem or any other algorithm: it would
break whatever you are doing either way. A hashmap is a hashmap.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: KStem custom lexicons configuration possible?

Lukáš Vlček
Hi Robert,

I think the difference between KStem and other stemmers (at least those that I am aware of, like snowball or porter) is that KStem is expected to produce a real valid words and thus other filtering can be applied to the tokens after stemming more easily (for example synonym expansion). Not sure if this is the case with other available stemmers in Lucene.

Also my impression from reading the original paper by Robert Krovetz was that possibility to fine-tune lexicons is practical. So that is why I was expecting that KStem API should support this as well.

Well, may be a combination of KStem with Override filter (but applied AFTER stemming) would work too in this case :-)

Regards,
Lukas

On Mon, Jun 20, 2011 at 2:32 PM, Robert Muir <[hidden email]> wrote:
On Mon, Jun 20, 2011 at 8:23 AM, Lukáš Vlček <[hidden email]> wrote:
> Hi Robert,
> this sounds interesting I will look at it in more detail.
> However, I do not think this is really a general solution. If I understand
> StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
> that you *know* exact term (the key in the map) in advance. In other words
> if I wanted to "fix" some term produced by Kstem filter I would have to know
> what is the product of the stemming in advance. Now, this means that if I
> switch to snowball or porter or other stemmer instead of KStem or simply
> update something else in the filtering chain then I am in trouble. Also if I
> understand correctly the original KStem implementation it can still get
> updates to lexicons which means that once these updates are ported to Java
> implementation it can again result in problem with existing override filter
> setup.
> More generally, is there any reason why lexicons are not configurable in

Because we have StemmerOverrideFilter and KeywordMarkerFilter.

look at the source code to Kstem: it uses maps and sets of exceptions,
this is what these filters provide in a general way
(StemmerOverrideFilter being the map, and KeywordMarkerFilter being
the set).

we added these to work across the board with all lucene stemmers for
this reason.

I don't understand your concerns at all to be honest, they make no
sense to me. If we "updated" kstem or any other algorithm: it would
break whatever you are doing either way. A hashmap is a hashmap.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]