Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

Michael Froh
I am currently working on migrating a project from an old version of Solr to Elasticsearch, and came across a funny (to me at least) difference in the "default" behavior of JapanesePartOfSpeechStopFilterFactory.

If JapanesePartOfSpeechStopFilterFactory is given empty args, it does nothing. It doesn't load any stop tags, and just passes along the TokenStream passed to create(). (By comparison, the Elasticsearch filter will default to loading the stop tags shipped in the Kuromoji analyzer JAR.) So, for many years, my project was not using JapanesePartOfSpeechStopFilter, when I thought that it was.

I would like to create an issue and submit a patch, in case other users out there are failing to use the filter factory correctly, but I'm not sure what the best approach is, between:

1. If someone doesn't specify the tags argument, then throw an exception (because the user probably doesn't know what they're doing).
2. If someone doesn't specify the tags argument, then load the default stop tags (like JapaneseAnalyzer does).

I would lean more toward 1, to avoid a silent change in behavior.
Reply | Threaded
Open this post in threaded view
|

Re: Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

Michael McCandless-2
+1 to make this less trappy.

It looks like KoreanPartOfSpeechStopFilterFactory will fallback to default stop tags if no args were provided.  I think we should indeed make JapanesePartOfSpeechStopFilterFactory consistent.

Maybe, we fix this only in next major release (9.0), add an entry to MIGRATE.txt explaining that, and go with option 2?  And possibly option 1 for 8.x releases?  (Or maybe don't fix it in 8.x releases... not sure).

On Fri, Oct 2, 2020 at 12:10 PM Michael Froh <[hidden email]> wrote:
I am currently working on migrating a project from an old version of Solr to Elasticsearch, and came across a funny (to me at least) difference in the "default" behavior of JapanesePartOfSpeechStopFilterFactory.

If JapanesePartOfSpeechStopFilterFactory is given empty args, it does nothing. It doesn't load any stop tags, and just passes along the TokenStream passed to create(). (By comparison, the Elasticsearch filter will default to loading the stop tags shipped in the Kuromoji analyzer JAR.) So, for many years, my project was not using JapanesePartOfSpeechStopFilter, when I thought that it was.

I would like to create an issue and submit a patch, in case other users out there are failing to use the filter factory correctly, but I'm not sure what the best approach is, between:

1. If someone doesn't specify the tags argument, then throw an exception (because the user probably doesn't know what they're doing).
2. If someone doesn't specify the tags argument, then load the default stop tags (like JapaneseAnalyzer does).

I would lean more toward 1, to avoid a silent change in behavior.
Reply | Threaded
Open this post in threaded view
|

Re: Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

Michael Froh
Thanks!

I created an issue (https://issues.apache.org/jira/browse/LUCENE-9567) and PR (https://github.com/apache/lucene-solr/pull/1961), and followed your suggestion of using the default stop tags and modifying MIGRATE.md.

Given that the "do nothing" behavior has been around for years, I don't see much need to change it in 8.x (though I'm happy to do that if someone asks).

On Fri, Oct 2, 2020 at 9:49 AM Michael McCandless <[hidden email]> wrote:
+1 to make this less trappy.

It looks like KoreanPartOfSpeechStopFilterFactory will fallback to default stop tags if no args were provided.  I think we should indeed make JapanesePartOfSpeechStopFilterFactory consistent.

Maybe, we fix this only in next major release (9.0), add an entry to MIGRATE.txt explaining that, and go with option 2?  And possibly option 1 for 8.x releases?  (Or maybe don't fix it in 8.x releases... not sure).

On Fri, Oct 2, 2020 at 12:10 PM Michael Froh <[hidden email]> wrote:
I am currently working on migrating a project from an old version of Solr to Elasticsearch, and came across a funny (to me at least) difference in the "default" behavior of JapanesePartOfSpeechStopFilterFactory.

If JapanesePartOfSpeechStopFilterFactory is given empty args, it does nothing. It doesn't load any stop tags, and just passes along the TokenStream passed to create(). (By comparison, the Elasticsearch filter will default to loading the stop tags shipped in the Kuromoji analyzer JAR.) So, for many years, my project was not using JapanesePartOfSpeechStopFilter, when I thought that it was.

I would like to create an issue and submit a patch, in case other users out there are failing to use the filter factory correctly, but I'm not sure what the best approach is, between:

1. If someone doesn't specify the tags argument, then throw an exception (because the user probably doesn't know what they're doing).
2. If someone doesn't specify the tags argument, then load the default stop tags (like JapaneseAnalyzer does).

I would lean more toward 1, to avoid a silent change in behavior.