Weird behavioural differences between pf in dismax and edismax

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Weird behavioural differences between pf in dismax and edismax

Sambhav Kothari
Hello,

I experienced a weird behaviour with dismax and edismax query parsers.
Dismax will include pf boosts when we query something that has just a
single word, edismax on the other hand will not include pf boosts.

The result is that a dismax and an edismax handler with the same set of
defaults, return different results for single word queries (eg. "Hello")
but the same results for multi word queries (eg. "Hello Wold")

Is this expected?

Regards,
Sam
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Andrea Gazzarini-6
Hi Sam,
I noticed the same behaviour. Looking at the code it seems that it is
expected: the two classes (ExtendedDisMaxQParser and DisMaxQParser)
don't have a direct inheritance relationships and the methods which deal
with the PF parameter are different. Specifically, the
DismaxQParser.getPhraseQuery seems to produce the query phrase
regardless the number of terms that compose the query (and this matches
with the observed behaviour), while the ExtendedDismax seems to take in
account this aspect .

I agree with you, it results in a different behaviour, even for those
single-word queries that output more than one terms (e.g. putting a pf
clause with a q=hello-world and a field in qf which uses a
StandardTokenizer or a WordDelimiterFilter in the analyzer).

About the reason of such different implementation, I don't know, maybe
someone else here is able to help you.

Best,
Andrea

On 27/05/18 15:14, Sambhav Kothari wrote:

> Hello,
>
> I experienced a weird behaviour with dismax and edismax query parsers.
> Dismax will include pf boosts when we query something that has just a
> single word, edismax on the other hand will not include pf boosts.
>
> The result is that a dismax and an edismax handler with the same set of
> defaults, return different results for single word queries (eg. "Hello")
> but the same results for multi word queries (eg. "Hello Wold")
>
> Is this expected?
>
> Regards,
> Sam
>

Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Sambhav Kothari
Hi Andrea,

Yes, on further investigation I found -
https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java#L574

I also created a ticket to resolve these differences and have a uniform way
of handling query parsing between dismax and edismax, but it was closed
since it required further discussion on whether this was intended or not.

I personally found adding phrasal boosts to single token queries very
non-intuitive and IMO dismax parser should be updated to apply phrasal
boosts to multi-token queries.

You can find further details about the problem here -
https://issues.apache.org/jira/browse/SOLR-12409

Regards,
Sam

On Mon, May 28, 2018 at 12:31 PM, Andrea Gazzarini <[hidden email]>
wrote:

> Hi Sam,
> I noticed the same behaviour. Looking at the code it seems that it is
> expected: the two classes (ExtendedDisMaxQParser and DisMaxQParser) don't
> have a direct inheritance relationships and the methods which deal with the
> PF parameter are different. Specifically, the DismaxQParser.getPhraseQuery
> seems to produce the query phrase regardless the number of terms that
> compose the query (and this matches with the observed behaviour), while the
> ExtendedDismax seems to take in account this aspect .
>
> I agree with you, it results in a different behaviour, even for those
> single-word queries that output more than one terms (e.g. putting a pf
> clause with a q=hello-world and a field in qf which uses a
> StandardTokenizer or a WordDelimiterFilter in the analyzer).
>
> About the reason of such different implementation, I don't know, maybe
> someone else here is able to help you.
>
> Best,
> Andrea
>
>
> On 27/05/18 15:14, Sambhav Kothari wrote:
>
>> Hello,
>>
>> I experienced a weird behaviour with dismax and edismax query parsers.
>> Dismax will include pf boosts when we query something that has just a
>> single word, edismax on the other hand will not include pf boosts.
>>
>> The result is that a dismax and an edismax handler with the same set of
>> defaults, return different results for single word queries (eg. "Hello")
>> but the same results for multi word queries (eg. "Hello Wold")
>>
>> Is this expected?
>>
>> Regards,
>> Sam
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Alessandro Benedetti
In my opinion, given the definition of dismax and edismax query parsers, they
should behave the same for parameters in common.
To be a little bit extreme I don't think we need the dismax query parser at
all anymore ( in the the end edismax  is only offering more than the dismax)

Finally, I do believe that even if the query is a single term ( before or
after the analysis for a PF field) it should anyway boost the phrase.
A phrase of 1 word is still a phrase, isn't it ?





-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

eaph
I disagree that a phrase of 1-word is just a phrase.  That is the core
difference between the qf and pf clauses.  Qf is collecting the terms; pf
is boosting the combinations.

For queries where the original query phrase has only a single term in it,
then it might be a moot point, unless the clauses are being pointed at
different fields or different boosts.

But for multi-term queries, pf (and pf2 and pf3) can be important
differentiators between documents that just happen to have enough words
from the user's original query, and documents that get closer to the user's
meaning.    It balances the documents that have enough terms per mm and
documents that have enough terms in one field.

Elizabeth Haubert






On Tue, May 29, 2018 at 5:14 AM, Alessandro Benedetti <[hidden email]>
wrote:

> In my opinion, given the definition of dismax and edismax query parsers,
> they
> should behave the same for parameters in common.
> To be a little bit extreme I don't think we need the dismax query parser at
> all anymore ( in the the end edismax  is only offering more than the
> dismax)
>
> Finally, I do believe that even if the query is a single term ( before or
> after the analysis for a PF field) it should anyway boost the phrase.
> A phrase of 1 word is still a phrase, isn't it ?
>
>
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Alessandro Benedetti
I don't have any hard position on this, It's ok to not build a phrase boost
if the input query is 1 term and it remains one term after the analysis for
one of the pf fields.

But if the term produces multiple tokens after query time analysis, I do
believe that building a phrase boost should be the correct interpretation (
e.g. wi-fi with a query time analiser which split by - ) .

Cheers







-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

eaph
That would make sense.
Multi-term synonyms get into a weird case too.  Should the single-term
words that have multi-term synonyms expand out? Or should the multi-term
synonyms that have single-term synonyms contract down and count as only a
single clause for pf2 or pf3.



On Tue, May 29, 2018 at 1:37 PM, Alessandro Benedetti <[hidden email]>
wrote:

> I don't have any hard position on this, It's ok to not build a phrase boost
> if the input query is 1 term and it remains one term after the analysis for
> one of the pf fields.
>
> But if the term produces multiple tokens after query time analysis, I do
> believe that building a phrase boost should be the correct interpretation (
> e.g. wi-fi with a query time analiser which split by - ) .
>
> Cheers
>
>
>
>
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Sambhav Kothari
Wouldn't all of this depend entirely on the tokenizers used? I was talking
about phrases in a multi-token sense.

Regardless, I still think there should be similarity between dismax and
edismax for the commonly parameters. (Either extend the edismax logic to
dismax or vice versa)

Regards,
Sam

On Tue, May 29, 2018, 23:16 Elizabeth Haubert <
[hidden email]> wrote:

> That would make sense.
> Multi-term synonyms get into a weird case too.  Should the single-term
> words that have multi-term synonyms expand out? Or should the multi-term
> synonyms that have single-term synonyms contract down and count as only a
> single clause for pf2 or pf3.
>
>
>
> On Tue, May 29, 2018 at 1:37 PM, Alessandro Benedetti <
> [hidden email]>
> wrote:
>
> > I don't have any hard position on this, It's ok to not build a phrase
> boost
> > if the input query is 1 term and it remains one term after the analysis
> for
> > one of the pf fields.
> >
> > But if the term produces multiple tokens after query time analysis, I do
> > believe that building a phrase boost should be the correct
> interpretation (
> > e.g. wi-fi with a query time analiser which split by - ) .
> >
> > Cheers
> >
> >
> >
> >
> >
> >
> >
> > -----
> > ---------------
> > Alessandro Benedetti
> > Search Consultant, R&D Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Alessandro Benedetti
Question in general for the community :
what is the dismax capable of doing that the edismax is not ?
Is it really necessary to keep both of them or the dismax could be
deprecated ?

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Weird behavioural differences between pf in dismax and edismax

Sambhav Kothari
Hi,

We personally use dismax as a more basic search endpoint so that users who
are not aware for lucene syntax don't end up using special keywords or
chars. which might affect their search queries.
The switch between dismax and edismax is triggered by an advanced get param.

I imagine there might be others who use it for similar purposes.

Regards,
Sam

On Wed, May 30, 2018 at 7:29 PM, Alessandro Benedetti <[hidden email]>
wrote:

> Question in general for the community :
> what is the dismax capable of doing that the edismax is not ?
> Is it really necessary to keep both of them or the dismax could be
> deprecated ?
>
> Cheers
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>