KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma-2
Hello, apologies for this long winded e-mail.

Our fields have KeywordRepeat and language specific filters such as a stemmer, the final filter at query-time is SynonymGraph. We do not use RemoveDuplicatesFilter for those of you wondering why when you see the parsed queries below, this is due to [1].

We use a custom QParser extending edismax and also extend ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we have to. The problem also directly applies to Solr's vanilla edismax. The file synonyms.txt contains the stemmed versions of the original terms.

Consider this example synonym set [bier,brouw] where bier means beer and brouw is the stemmed version of brouwsel (brewage, concoction), and consider these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 5<-2 6<90%25.

The queries q=bier and q=brouw both parse to the following query and give the desired results (notice the missing RemoveDuplicates here):
+(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier content_nl:brouw))~2))

However, for q=brouwsel something (partially) unexpected happens:
+(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))

This results in a BooleanQuery where, due to mm=2, both clauses need to match, giving very few matches. Removing KeywordRepeat or setting mm=1 of course fixes the problem, but that is not what we want.

What is also unexpected, and may be related to the problem, is that when checking the analzer output via the GUI, we see the position incrementing when KeywordRepeat and SynonymGraph are combined. When these filters are not combined, the positions are always 1, as expected. When combined we get this for 'brouw':
term: bier brouw bier brouw
pos:  1     1         2      2

or for 'brouwsel':
term: brouwsel bier brouw
pos:  1               2      2

ExtendedSolrQueryParser, and everything underneath, is a complicated piece of code. In the end it extends Lucene's QueryBuilder, but not always relying on its results, it seems. Edismax for example 'resets' minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and i am a bit too deep in this unfamiliar area, and i am in need of help here.

So, my question is, how to solve this problem? Or how to approach it?  What is the actual problem? How can i get the same stable results for both queries? Does the odd positon increment have anything to do with it (it seems Lucene's QueryBuilder does something with it). What do i need to do?

Many thanks,
Markus

ps. this is on Solr 7.2.1 and 7.5.0.

[1] http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
Reply | Threaded
Open this post in threaded view
|

RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma-2
Hello,

Apologies for bothering you all again, but i really need some help in this matter. How can we resolve this issue? Are we dealing with a bug here (then i'll open a ticket), am i doing something wrong?

Is here anyone who had the same issue or understand the problem?

Many thanks,
Markus

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Tuesday 13th November 2018 9:52
> To: solr-user <[hidden email]>
> Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)
>
> Hello, apologies for this long winded e-mail.
>
> Our fields have KeywordRepeat and language specific filters such as a stemmer, the final filter at query-time is SynonymGraph. We do not use RemoveDuplicatesFilter for those of you wondering why when you see the parsed queries below, this is due to [1].
>
> We use a custom QParser extending edismax and also extend ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we have to. The problem also directly applies to Solr's vanilla edismax. The file synonyms.txt contains the stemmed versions of the original terms.
>
> Consider this example synonym set [bier,brouw] where bier means beer and brouw is the stemmed version of brouwsel (brewage, concoction), and consider these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 5<-2 6<90%25.
>
> The queries q=bier and q=brouw both parse to the following query and give the desired results (notice the missing RemoveDuplicates here):
> +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier content_nl:brouw))~2))
>
> However, for q=brouwsel something (partially) unexpected happens:
> +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
>
> This results in a BooleanQuery where, due to mm=2, both clauses need to match, giving very few matches. Removing KeywordRepeat or setting mm=1 of course fixes the problem, but that is not what we want.
>
> What is also unexpected, and may be related to the problem, is that when checking the analzer output via the GUI, we see the position incrementing when KeywordRepeat and SynonymGraph are combined. When these filters are not combined, the positions are always 1, as expected. When combined we get this for 'brouw':
> term: bier brouw bier brouw
> pos:  1     1         2      2
>
> or for 'brouwsel':
> term: brouwsel bier brouw
> pos:  1               2      2
>
> ExtendedSolrQueryParser, and everything underneath, is a complicated piece of code. In the end it extends Lucene's QueryBuilder, but not always relying on its results, it seems. Edismax for example 'resets' minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and i am a bit too deep in this unfamiliar area, and i am in need of help here.
>
> So, my question is, how to solve this problem? Or how to approach it?  What is the actual problem? How can i get the same stable results for both queries? Does the odd positon increment have anything to do with it (it seems Lucene's QueryBuilder does something with it). What do i need to do?
>
> Many thanks,
> Markus
>
> ps. this is on Solr 7.2.1 and 7.5.0.
>
> [1] http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
>
Reply | Threaded
Open this post in threaded view
|

RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello,

I have opened a SOLR-13009 describing the problem. The attached patch contains a unit test proving the problem, i.e. the test fails. Any help would be greatly appreciated.

Many thanks,
Markus

https://issues.apache.org/jira/browse/SOLR-13009

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Sunday 18th November 2018 23:21
> To: [hidden email]; solr-user <[hidden email]>
> Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)
>
> Hello,
>
> Apologies for bothering you all again, but i really need some help in this matter. How can we resolve this issue? Are we dealing with a bug here (then i'll open a ticket), am i doing something wrong?
>
> Is here anyone who had the same issue or understand the problem?
>
> Many thanks,
> Markus
>
>  
>  
> -----Original message-----
> > From:Markus Jelsma <[hidden email]>
> > Sent: Tuesday 13th November 2018 9:52
> > To: solr-user <[hidden email]>
> > Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)
> >
> > Hello, apologies for this long winded e-mail.
> >
> > Our fields have KeywordRepeat and language specific filters such as a stemmer, the final filter at query-time is SynonymGraph. We do not use RemoveDuplicatesFilter for those of you wondering why when you see the parsed queries below, this is due to [1].
> >
> > We use a custom QParser extending edismax and also extend ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we have to. The problem also directly applies to Solr's vanilla edismax. The file synonyms.txt contains the stemmed versions of the original terms.
> >
> > Consider this example synonym set [bier,brouw] where bier means beer and brouw is the stemmed version of brouwsel (brewage, concoction), and consider these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 5<-2 6<90%25.
> >
> > The queries q=bier and q=brouw both parse to the following query and give the desired results (notice the missing RemoveDuplicates here):
> > +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier content_nl:brouw))~2))
> >
> > However, for q=brouwsel something (partially) unexpected happens:
> > +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> >
> > This results in a BooleanQuery where, due to mm=2, both clauses need to match, giving very few matches. Removing KeywordRepeat or setting mm=1 of course fixes the problem, but that is not what we want.
> >
> > What is also unexpected, and may be related to the problem, is that when checking the analzer output via the GUI, we see the position incrementing when KeywordRepeat and SynonymGraph are combined. When these filters are not combined, the positions are always 1, as expected. When combined we get this for 'brouw':
> > term: bier brouw bier brouw
> > pos:  1     1         2      2
> >
> > or for 'brouwsel':
> > term: brouwsel bier brouw
> > pos:  1               2      2
> >
> > ExtendedSolrQueryParser, and everything underneath, is a complicated piece of code. In the end it extends Lucene's QueryBuilder, but not always relying on its results, it seems. Edismax for example 'resets' minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and i am a bit too deep in this unfamiliar area, and i am in need of help here.
> >
> > So, my question is, how to solve this problem? Or how to approach it?  What is the actual problem? How can i get the same stable results for both queries? Does the odd positon increment have anything to do with it (it seems Lucene's QueryBuilder does something with it). What do i need to do?
> >
> > Many thanks,
> > Markus
> >
> > ps. this is on Solr 7.2.1 and 7.5.0.
> >
> > [1] http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
> >
>