[jira] Created: (LUCENE-2503) light/minimal stemming for euro languages

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org
light/minimal stemming for euro languages
-----------------------------------------

                 Key: LUCENE-2503
                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
            Reporter: Robert Muir
            Assignee: Robert Muir
            Priority: Minor
             Fix For: 3.1, 4.0


The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.

Some applications may want to perform less aggressive stemming, for example:
http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer

Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2503:
--------------------------------

    Attachment: LUCENE-2503.patch

patch, not ready for committing. only some of these are ready, others need tests (where I intentionally put a fail() placeholder to indicate they are still untested).

also i didn't implement the finnish one yet, but it contains various implementations for 9 euro languages.


> light/minimal stemming for euro languages
> -----------------------------------------
>
>                 Key: LUCENE-2503
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2503.patch
>
>
> The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.
> Some applications may want to perform less aggressive stemming, for example:
> http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer
> Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879936#action_12879936 ]

Otis Gospodnetic commented on LUCENE-2503:
------------------------------------------

Man are you fast!
Does the English one deal with women/ woman and foci / focus type stuff?


> light/minimal stemming for euro languages
> -----------------------------------------
>
>                 Key: LUCENE-2503
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2503.patch
>
>
> The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.
> Some applications may want to perform less aggressive stemming, for example:
> http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer
> Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879941#action_12879941 ]

Robert Muir commented on LUCENE-2503:
-------------------------------------

bq. Man are you fast!

not really, i've been working it for a while but since someone asked i figure i would create the issue.
testing isnt done, but english, french, portuguese I think are ok.
the others need a lot of tests and probably have bugs.

bq. Does the English one deal with women/ woman and foci / focus type stuff?

Nope, the english one is the Harman "s-stemming" algorithm.

its very simple:
{noformat}
if final is '-ies' but not '-eies' or '-aies' then
replace '-ies' by '-y', return;
if final is '-es' but not '-aes', '-ees' or '-oes' then
replace '-es' by '-e', return;
if final is '-s' but not '-us' or '-ss' then
remove '-s';
return.
{noformat}

For special cases like you mentioned (if you want them), i would recommend adding these customizations yourself
as documented here: http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming

just make a tab-separated file of words-stems and put a StemmerOverrideFilter(Factory) before the stemmer in the stream.

I think this alone provides a lot of flexibility. if it isn't enough, then i think these stemmers are much simpler to modify if you wanted to go that route also :)


> light/minimal stemming for euro languages
> -----------------------------------------
>
>                 Key: LUCENE-2503
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2503.patch
>
>
> The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.
> Some applications may want to perform less aggressive stemming, for example:
> http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer
> Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2503:
--------------------------------

    Attachment: LUCENE-2503.patch

I updated the patch, I think this is ready to go:

* added finnish
* created vocabulary tests from reference C,perl,whatever impls, and found/fixed bugs in every language but en,pt,fr (as promised in my last comment)
* created a VocabularyAssert junit util class, and refactored the existing snowball,porter,german,and russian tests to use it, too.
* refactored a bunch of utility stuff that was duplicated everywhere such as endsWith()/delete() and put it in StemmerUtil.

to apply the patch, first apply the patch itself, then please unzip the zip file containing vocabulary tests (LUCENE-2503_modules_analysis_testdata.zip) from the modules/analysis/common dir.

if no one objects, i'll commit in a few days.


> light/minimal stemming for euro languages
> -----------------------------------------
>
>                 Key: LUCENE-2503
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2503.patch, LUCENE-2503.patch
>
>
> The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.
> Some applications may want to perform less aggressive stemming, for example:
> http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer
> Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2503:
--------------------------------

    Attachment: LUCENE-2503_modules_analysis_testdata.zip

zip file containing the vocab test zipfiles, relevant to modules/analysis

> light/minimal stemming for euro languages
> -----------------------------------------
>
>                 Key: LUCENE-2503
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2503.patch, LUCENE-2503.patch, LUCENE-2503_modules_analysis_testdata.zip
>
>
> The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.
> Some applications may want to perform less aggressive stemming, for example:
> http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer
> Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-2503) light/minimal stemming for euro languages

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2503.
---------------------------------

    Resolution: Fixed

Committed revision 964019 (trunk) / 964034 (3x)

> light/minimal stemming for euro languages
> -----------------------------------------
>
>                 Key: LUCENE-2503
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2503
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2503.patch, LUCENE-2503.patch, LUCENE-2503_modules_analysis_testdata.zip
>
>
> The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.
> Some applications may want to perform less aggressive stemming, for example:
> http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer
> Good, relevance tested algorithms exist and I think we should provide these alternatives.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]