[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (LUCENE-8876) EnglishMinimalStemmer does not implement s-stemmer paper correctly?

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876507#comment-16876507 ]

Mark Harwood commented on LUCENE-8876:

I reached out the paper author, Donna Harman a while ago and she just replied as follows:
{quote}It has been a very long time since I have thought about S-stemmers.   But looking at your examples of bees and employees, it seems to me that rule 3 is the correct one because rule 2 would be prevented from firing. 

Given her assertion that rule 3 should apply to "bees" then it looks like that this would make rule 2 entirely redundant.

> EnglishMinimalStemmer does not implement s-stemmer paper correctly?
> -------------------------------------------------------------------
>                 Key: LUCENE-8876
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8876
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Mark Harwood
>            Priority: Minor
> The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and employees.
> The [original paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=]] has this table of rules:
> !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png!
> The notes accompanying the table state :
> {quote}"the first applicable rule encountered is the only one used"
> {quote}
> For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes != tomato}}. The {{oes}} and {{ees}} suffixes are left intact.
> "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 in the table depending on if you take {{applicable}} to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer has assumed the latter and I think it should be the former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.
> There are some additional changes I'd like to see in a plural stemmer but I won't list them here - the focus should be making the code here match the original paper it references.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]