Limitations of StempelStemmer

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Limitations of StempelStemmer

Maciej Gawinecki
Hi,

I have just checked out the latest version of Lucene from Git master branch.

I have tried to stem a few words using StempelStemmer for Polish.
However, it looks it cannot handle some words properly, e.g.

joyce -> ąć
wielce -> ąć
piwko -> ąć
royce -> ąć
pip -> ąć
xyz -> xyz

1. I surprised it cannot handle Polish words like wielce, piwko and
royce. Is this a limitation of the stemming algorithm or a training of
the algorithm or something else? The latter would help improve the
situation. How can I improve that behaviour?
2. I am surprised that for non-Polish words it returns "ać". I would
expect that for words it has not be trained for it will return their
original forms, as it happens, for instance, when stemming words like
"xyz".

With kind regards,
Maciej Gawinecki

Here's minimal example to reproduce the issue:

package org.apache.lucene.analysis;

import java.io.InputStream;
import org.apache.lucene.analysis.stempel.StempelStemmer;

public class Try {

  public static void main(String[] args) throws Exception {
    InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
        .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
    StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
    String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
    for (String word : words) {
      System.out.println(String.format("%s -> %s", word,
stemmer.stem("piwko")));
    }

  }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Limitations of StempelStemmer

Dawid Weiss-2
Hi Maciej,

Stempel uses a pretrained heuristic. You can find a longer description
at [1] and [2]. The specific reason for the problems you mentioned may
be the smaller training dictionary used for the version embedded in
Lucene, I honestly don't know. If you need exact stemming/
lemmatization then take a look at dictionary methods -- Morfologik or
the tools listed at [3].

Dawid

[1] http://www.getopt.org/stempel/
[2] https://lucene.apache.org/core/8_2_0/analyzers-stempel/index.html
[3] http://zil.ipipan.waw.pl/

On Tue, Sep 10, 2019 at 9:31 PM Maciej Gawinecki <[hidden email]> wrote:

>
> Hi,
>
> I have just checked out the latest version of Lucene from Git master branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ąć
> wielce -> ąć
> piwko -> ąć
> royce -> ąć
> pip -> ąć
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "ać". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
>   public static void main(String[] args) throws Exception {
>     InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
>         .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
>     StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
>     String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
>     for (String word : words) {
>       System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
>     }
>
>   }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Limitations of StempelStemmer

Martin Grigorov
In reply to this post by Maciej Gawinecki
Hi,

On Tue, Sep 10, 2019, 22:31 Maciej Gawinecki <[hidden email]> wrote:

> Hi,
>
> I have just checked out the latest version of Lucene from Git master
> branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ąć
> wielce -> ąć
> piwko -> ąć
> royce -> ąć
> pip -> ąć
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "ać". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
>   public static void main(String[] args) throws Exception {
>     InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
>
> .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
>     StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
>     String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
>     for (String word : words) {
>       System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
>

You always pass "piwko" for stemming.

    }

>
>   }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Limitations of StempelStemmer

Dawid Weiss-2
> You always pass "piwko" for stemming.

I'm afraid that's not correct? You should *never* pass on piwko when
stemming. :)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Limitations of StempelStemmer

Maciej Gawinecki
In reply to this post by Martin Grigorov
> You always pass "piwko" for stemming.

 Right, I've spotted my mistake once I've posted my question but
didn't want spam with too many posts (there's no way to edit already
posted question in a mailing list :-)). Anyway, the issue still
persists. Here's the corrected version to reproduce it:

import java.io.InputStream;
import org.apache.lucene.analysis.stempel.StempelStemmer;

public class Try {

  public static void main(String[] args) throws Exception {
    InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
        .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
    StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
    String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
    for (String word : words) {
      System.out.println(String.format("%s -> %s", word,
stemmer.stem(word)));
    }

  }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Limitations of StempelStemmer

Maciej Gawinecki
In reply to this post by Dawid Weiss-2
>
> > You always pass "piwko" for stemming.
>
> I'm afraid that's not correct? You should *never* pass on piwko when
> stemming. :)

Haha, right, one should not mix both.

Anyway, thank your for your original suggestions. Training it with a
bigger corpus of inflection forms seems like a great idea. Now we have
many more corpora available (e.g., SGJP [1], Polimorf [2]
morphological dictionaries from Morfeusz) Andrzej Białecki, the
original author, had when training the stemmer. I might give it a try,
just need to find some spare time :-)

[1]: http://download.sgjp.pl/morfeusz/20190925/sgjp-20190925.tab.gz
[2]: http://download.sgjp.pl/morfeusz/20190925/polimorf-20190925.tab.gz

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]