[jira] [Commented] (LUCENE-7444) Remove English stopwords default from StandardAnalyzer in Lucene-Core

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-7444) Remove English stopwords default from StandardAnalyzer in Lucene-Core

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510955#comment-16510955 ]

ASF subversion and git services commented on LUCENE-7444:
---------------------------------------------------------

Commit 5ae716c412d705570b2dafd423755eb58142212e in lucene-solr's branch refs/heads/master from [~romseygeek]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ae716c ]

LUCENE-7444: StandardAnalyzer not longer uses english stopwords by default


> Remove English stopwords default from StandardAnalyzer in Lucene-Core
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-7444
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7444
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/other, modules/analysis
>    Affects Versions: 6.2
>            Reporter: Uwe Schindler
>            Priority: Major
>             Fix For: master (8.0)
>
>         Attachments: LUCENE-7444.patch
>
>
> Yonik said on LUCENE-7318:
> {quote}
> bq. I think it would make a good default for most Lucene users, and we should graduate it from the analyzers module into core, and make it the default for IndexWriter.
> This "StandardAnalyzer" is specific to English, as it removes English stopwords.
> That seems to be an odd choice now for a few reasons:
> - It was argued in the past (rather vehemently) that Solr should not prefer english in it's default "text" field
> - AFAIK, removing stopwords is no longer considered best practice.
> Given that removal of english stopwords is the only thing that really makes this analyzer english-centric (and given the negative impact that can have on other languages), it seems like the stopword filter should be removed from StandardAnalyzer.
> {quote}
> When trying to fix the backwards incompatibility issues in LUCENE-7318, it looks like most unrelated code moved from analysis module to core (and changing package names!!!! :( ) was related to word list loading, CharArraySets, and superclasses of StopFilter. If we follow Yonik's suggestion, we can revert all those changes. I agree with hin, an "universal" analyzer should not have any language specific stop-words.
> The other thing is LowercaseFilter, but I'd suggest to simply add a clone of it to Lucene core and leave the analysis-module self-contained.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]