Inconsistent query results in Lucene 8.1.0

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent query results in Lucene 8.1.0

Fiona Hasanaj
Hello,

I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303). 

From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:

(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))

The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour? 

After flattening the nested disjunctions (part of query rewrite process), the query looks like below:

((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)

As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour? 

Looks like the issue stems from one of the behaviours observed and highlighted in bold. 

Looking forward to hearing back from you.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

names_index.tar.gz (4K) Download Attachment
LuceneSearchIndex.java (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent query results in Lucene 8.1.0

Michele Palmia
Hi all,

I looked into this today. I can reproduce it and I believe it's a bug. 
This is caused by the following working together:
- LUCENE-7386 Flatten nested disjunctions
- LUCENE-7925 Deduplicate SHOULD and MUST clauses in BooleanQuery

Blended term queries modify the df/ttf of their terms to make sure all terms produce identical scores. In this case, two blended term queries contain a few terms each, only some of which overlap. The two queries calculate different df/ttf for their terms respectively, since the two sets are different. During the rewrite process,
  1. the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
  2. the nested Boolean queries get flattened, since they are nested disjunctions
  3. the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).
I haven't managed to create a failing test yet, I'll share it when I have one ready.
If anybody has suggestions or pointers on how this should be fixed, I'm also happy to provide a patch - I'm just a bit clueless what the right thing to do would be here: I have a feeling (2.) should not happen for (rewritten) Blended Queries?

Cheers,
Michele


On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <[hidden email]> wrote:
Hello,

I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303). 

From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:

(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))

The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour? 

After flattening the nested disjunctions (part of query rewrite process), the query looks like below:

((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)

As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour? 

Looks like the issue stems from one of the behaviours observed and highlighted in bold. 

Looking forward to hearing back from you.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent query results in Lucene 8.1.0

Michael Sokolov-4
So - I think you should open an issue. Can you determine whether
flattening on its own would result in a bug? If not, then perhaps
focus on the merging (deduplication) and whether it properly respects
boosting?

On Fri, Mar 6, 2020 at 10:50 AM Michele Palmia <[hidden email]> wrote:

>
> Hi all,
>
> I looked into this today. I can reproduce it and I believe it's a bug.
> This is caused by the following working together:
> - LUCENE-7386 Flatten nested disjunctions
> - LUCENE-7925 Deduplicate SHOULD and MUST clauses in BooleanQuery
>
> Blended term queries modify the df/ttf of their terms to make sure all terms produce identical scores. In this case, two blended term queries contain a few terms each, only some of which overlap. The two queries calculate different df/ttf for their terms respectively, since the two sets are different. During the rewrite process,
>
> the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
> the nested Boolean queries get flattened, since they are nested disjunctions
> the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).
>
> I haven't managed to create a failing test yet, I'll share it when I have one ready.
> If anybody has suggestions or pointers on how this should be fixed, I'm also happy to provide a patch - I'm just a bit clueless what the right thing to do would be here: I have a feeling (2.) should not happen for (rewritten) Blended Queries?
>
> Cheers,
> Michele
>
>
> On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <[hidden email]> wrote:
>>
>> Hello,
>>
>> I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303).
>>
>> From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:
>>
>> (((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))
>>
>>
>> The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour?
>>
>> After flattening the nested disjunctions (part of query rewrite process), the query looks like below:
>>
>> ((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)
>>
>>
>> As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour?
>>
>> Looks like the issue stems from one of the behaviours observed and highlighted in bold.
>>
>> Looking forward to hearing back from you.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent query results in Lucene 8.1.0

Atri Sharma-3
In reply to this post by Michele Palmia
> the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
> the nested Boolean queries get flattened, since they are nested disjunctions
> the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).

This seems a bit worrisome in itself -- the data structure supporting
the implementation should not affect the selection.

--
Regards,

Atri
Apache Concerted

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Inconsistent query results in Lucene 8.1.0

Staley, Phil R - DCF
In reply to this post by Michele Palmia

We recently upgraded to our Drupal 8 sites to SOLR 8.3.1.  We are now getting reports of certain patterns of search terms resulting in an error that reads, “The website encountered an unexpected error. Please try again later.”

 

Below is a list of example terms that always result in this error and a similar list that works fine.  The problem pattern seems to be a search term that contains 2 or 3 characters followed by a space, followed by additional text.

 

To confirm that the problem is version 8 of SOLR, I have updated our local and UAT sites with the latest Drupal updates that did include an update to the Search API Solr module and tested the terms below under SOLR 7.7.2, 8.3.1, and 8.4.1.  Under version 7.7.2  everything works fine. Under either of the version 8, the problem returns.

 

Thoughts?

 

Search terms that result in error

• w-2 agency directory

• agency w-2 directory

• w-2 agency

• w-2 directory

• w2 agency directory

• w2 agency

• w2 directory

 

Search terms that do not result in error • w-22 agency directory • agency directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 • agency directory • agency • directory • -2 agency directory • 2 agency directory • w-2agency directory • w2agency directory

 

 

From: Michele Palmia <[hidden email]>
Sent: Friday, March 6, 2020 9:50 AM
To: [hidden email]
Subject: Re: Inconsistent query results in Lucene 8.1.0

 

Hi all,

 

I looked into this today. I can reproduce it and I believe it's a bug. 

This is caused by the following working together:
- LUCENE-7386 Flatten nested disjunctions

- LUCENE-7925 Deduplicate SHOULD and MUST clauses in BooleanQuery

 

Blended term queries modify the df/ttf of their terms to make sure all terms produce identical scores. In this case, two blended term queries contain a few terms each, only some of which overlap. The two queries calculate different df/ttf for their terms respectively, since the two sets are different. During the rewrite process,

  1. the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
  2. the nested Boolean queries get flattened, since they are nested disjunctions
  3. the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).

I haven't managed to create a failing test yet, I'll share it when I have one ready.

If anybody has suggestions or pointers on how this should be fixed, I'm also happy to provide a patch - I'm just a bit clueless what the right thing to do would be here: I have a feeling (2.) should not happen for (rewritten) Blended Queries?

 

Cheers,

Michele

 

 

On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <[hidden email]> wrote:

Hello,

 

I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303). 

 

From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:


(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))


The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour? 

 

After flattening the nested disjunctions (part of query rewrite process), the query looks like below:


((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)

 

As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour? 

 

Looks like the issue stems from one of the behaviours observed and highlighted in bold. 

 

Looking forward to hearing back from you.

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent query results in Lucene 8.1.0

david.w.smiley@gmail.com
Hi Phil,

Please start new threads (emails) for new problems instead of replying to an existing one.  The behavior of the existing thread does not result in an error; yours does, and so I think they are entirely dissimilar.  Also, you'll need to dig deeper to learn what the particular error was and report that.  Go to Solr's logs.

~ David Smiley
Apache Lucene/Solr Search Developer


On Fri, Mar 6, 2020 at 2:01 PM Staley, Phil R - DCF <[hidden email]> wrote:

We recently upgraded to our Drupal 8 sites to SOLR 8.3.1.  We are now getting reports of certain patterns of search terms resulting in an error that reads, “The website encountered an unexpected error. Please try again later.”

 

Below is a list of example terms that always result in this error and a similar list that works fine.  The problem pattern seems to be a search term that contains 2 or 3 characters followed by a space, followed by additional text.

 

To confirm that the problem is version 8 of SOLR, I have updated our local and UAT sites with the latest Drupal updates that did include an update to the Search API Solr module and tested the terms below under SOLR 7.7.2, 8.3.1, and 8.4.1.  Under version 7.7.2  everything works fine. Under either of the version 8, the problem returns.

 

Thoughts?

 

Search terms that result in error

• w-2 agency directory

• agency w-2 directory

• w-2 agency

• w-2 directory

• w2 agency directory

• w2 agency

• w2 directory

 

Search terms that do not result in error • w-22 agency directory • agency directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 • agency directory • agency • directory • -2 agency directory • 2 agency directory • w-2agency directory • w2agency directory

 

 

From: Michele Palmia <[hidden email]>
Sent: Friday, March 6, 2020 9:50 AM
To: [hidden email]
Subject: Re: Inconsistent query results in Lucene 8.1.0

 

Hi all,

 

I looked into this today. I can reproduce it and I believe it's a bug. 

This is caused by the following working together:
- LUCENE-7386 Flatten nested disjunctions

- LUCENE-7925 Deduplicate SHOULD and MUST clauses in BooleanQuery

 

Blended term queries modify the df/ttf of their terms to make sure all terms produce identical scores. In this case, two blended term queries contain a few terms each, only some of which overlap. The two queries calculate different df/ttf for their terms respectively, since the two sets are different. During the rewrite process,

  1. the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
  2. the nested Boolean queries get flattened, since they are nested disjunctions
  3. the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).

I haven't managed to create a failing test yet, I'll share it when I have one ready.

If anybody has suggestions or pointers on how this should be fixed, I'm also happy to provide a patch - I'm just a bit clueless what the right thing to do would be here: I have a feeling (2.) should not happen for (rewritten) Blended Queries?

 

Cheers,

Michele

 

 

On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <[hidden email]> wrote:

Hello,

 

I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303). 

 

From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:


(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))


The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour? 

 

After flattening the nested disjunctions (part of query rewrite process), the query looks like below:


((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)

 

As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour? 

 

Looks like the issue stems from one of the behaviours observed and highlighted in bold. 

 

Looking forward to hearing back from you.

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent query results in Lucene 8.1.0

Michele Palmia
In reply to this post by Michael Sokolov-4
Fiona - I opened a ticket for this. You can find some recommendations there that might help you fix your issue.
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent query results in Lucene 8.1.0

Adrien Grand
Thanks for digging this issue Michele.

On Tue, Mar 10, 2020 at 5:04 PM Michele Palmia <[hidden email]> wrote:
Fiona - I opened a ticket for this. You can find some recommendations there that might help you fix your issue.


--
Adrien