[jira] Created: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

JIRA jira@apache.org
DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score
--------------------------------------------------------------------------------------

                 Key: LUCENE-2686
                 URL: https://issues.apache.org/jira/browse/LUCENE-2686
             Project: Lucene - Java
          Issue Type: Bug
          Components: Search
            Reporter: Michael McCandless
             Fix For: 3.1, 4.0


Spinoff from java-user thread "question about Scorer.freq()" from Koji...

BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.

But, this scorer does too much work for collectors that never call .score, because it scores while it's matching.  It should only call .score on the subs when the caller calls its .score.

This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917672#action_12917672 ]

Paul Elschot commented on LUCENE-2686:
--------------------------------------

For the record, the reason for calling score() in the current way is to avoid the housekeeping of recording which scorer(s) actually matched.

Instead one could perhaps walk the binary tree in the scorer queue heap and only add to the sum score and recurse into the tree when the current doc of the scorer equals the current doc. Maybe this can even done without recursive calls.


> DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2686
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2686
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Michael McCandless
>             Fix For: 3.1, 4.0
>
>
> Spinoff from java-user thread "question about Scorer.freq()" from Koji...
> BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.
> But, this scorer does too much work for collectors that never call .score, because it scores while it's matching.  It should only call .score on the subs when the caller calls its .score.
> This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-2686:
-----------------------------------

    Attachment: Test2LUCENE2590.java

Thanks Mike for opening this!

The attached program is what I want to do - I'd like to know which field a match occurred. TestSubScorerFreqs of LUCENE-2590 calls BooleanScorer2.freq() and it returns expected freq count. In my program, I get TermScorer from BooleanScorer2 via ScorerVisitor and try to call TermScorer.freq() in collect() method:

{code}
public void collect(int doc) throws IOException {
  int freq = 0;
  for( TermQueryScorer tqs : tqsSet ){
    Scorer scorer = tqs.scorer;
    int matchId = scorer.docID();    // matchId isn't expected
    if( matchId == doc ){
      freq += scorer.freq();              // this line is never executed
    }
  }
  docCounts.put(doc + docBase, freq);
  collector.collect(doc);
}
{code}

but TermScorer.docID() returns unexpected id and TermScorer.freq() isn't executed (even if I remove "matchId == doc" condition, TermScorer.freq() returns unexpected number anyway).

> DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2686
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2686
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Michael McCandless
>             Fix For: 3.1, 4.0
>
>         Attachments: Test2LUCENE2590.java
>
>
> Spinoff from java-user thread "question about Scorer.freq()" from Koji...
> BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.
> But, this scorer does too much work for collectors that never call .score, because it scores while it's matching.  It should only call .score on the subs when the caller calls its .score.
> This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-2686:
------------------------------------------

    Assignee: Michael McCandless

> DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2686
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2686
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1, 4.0
>
>         Attachments: Test2LUCENE2590.java
>
>
> Spinoff from java-user thread "question about Scorer.freq()" from Koji...
> BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.
> But, this scorer does too much work for collectors that never call .score, because it scores while it's matching.  It should only call .score on the subs when the caller calls its .score.
> This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2686:
---------------------------------------

    Attachment: LUCENE-2686.patch

So... the good news is I made a new scorer (basically copied DisjunctionMaxScorer and then tweaked from there) that scores the OR-only case.  All tests pass w/ this new scorer.

And more good news is that if you don't score (I sort by doctitle to do that), you get a speedup -- 7.7% in my simplistic test (prefix query unit*, expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force it to use BS2 not BS -- the nocommits in the patch).

But the bad news is with scoring on it's 22.7% slower!

And, the weird news is, I discovered accidentally that BS2 is much (> 2X) faster for this one query.  I think we need to modify the criteria that decides whether to use BS or BS2...  maybe when there are lots of lowish-docFreq terms, BS2 is better?

> DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2686
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2686
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2686.patch, Test2LUCENE2590.java
>
>
> Spinoff from java-user thread "question about Scorer.freq()" from Koji...
> BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.
> But, this scorer does too much work for collectors that never call .score, because it scores while it's matching.  It should only call .score on the subs when the caller calls its .score.
> This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2686) DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2686:
---------------------------------------

    Attachment: LUCENE-2686.patch

New patch attached.

I added Koji's test as a unit test, that fails on trunk but passes now
with the patch.

The new scorer is definitely slower if you do want scoring, however,
it's actually uncommon for this scorer to be used... because BQ will
use BS when the query is all SHOULD clauses (plus up to 32 NOT).
Only if there is also 1 or more MUST clauses will this scorer be used,
or, if the collector does not support out-of-order scoring.  I had to
hack BQ to temporarily turn off BS to test this.

Insanely, the constant score BQ rewrite does use this scorer, because
when ConstantScorer invokes the sub-scorer it requires in-order
scoring.  So ConstantScoreQuery and constant score BQ rewrite for the
MTQs will always see the speedup from this patch.

So given that it's rare to actually use the scorer, I think ~8% gain
as seen by "default" usage makes it worthwhile net/net.

Longer term... we should probably change the Weight.scorer API so that
you notify it up-front if you need to do scoring.  This way
we can specialize (manually or automatically) the best class, instead
of waiting to see whether the .score() method is invoked per hit.


> DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2686
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2686
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2686.patch, LUCENE-2686.patch, Test2LUCENE2590.java
>
>
> Spinoff from java-user thread "question about Scorer.freq()" from Koji...
> BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.
> But, this scorer does too much work for collectors that never call .score, because it scores while it's matching.  It should only call .score on the subs when the caller calls its .score.
> This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]