minShouldMatch can't leapfrog

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

minShouldMatch can't leapfrog

Mikhail Khludnev
Developers,

I want to discuss few points regarding disjunction form of BooleanQuery with minShouldMatch constraint. I'm talking about doc-at-time evaluation only (BooleanScorer2).
Look into conjunction query which has disjunction in one of its' clauses e.g. +foo +(bar baz …). if disjunction "(bar baz … )" has high minShouldMatch constraint even if conjunction clause +foo is highly selective this query performs quite bad. It also happens if instead of +foo you have a filter. Once again: it's reasonable that disjunction even with high minShouldMatch is expensive, if "core" disjunction with minShouldMatch=1 matches few millions of docs. The problem is that I can't speed it up by supplying highly selective filter.
From my POV there two points in Lucene API which make leapfrog impossible:
- advance() is obliged to return next matching doc that causes scan in nextDoc() loop. It's great to have something like advanceExact(), or return some magic value from advance says - "failed to advance, propose next for leapfrog";
- Scorer is obliged to jump on the first matching doc after it's created that leads to scan many docs in nextDoc() loop;
- ConjunctiveScorer can't know which of its legs are not able to leapfrog and prefer to decline advance()
- Stefan spotted one more gain for minShouldMatch in DisjunctionSumScorer: We don't need to walk through top of heap after top scorer is advanced. Instead of this, let's pop minShouldMatch scores from the heap, look at the top after this - top doc can be evaluated as potential match which exceeds minShouldMatch constraint. After that, let's push those guys back to the heap advancing them to that candidate docnum. It's not cohesioned with leapfrog problem, though.

This last one looks impressing, but I'm not smart enough to realize that it gives performance gain. Do you think it's valuable optimization for Lucene users? How minShouldMatch is popular,btw? To be honest I'm not really suffer form minShouldMatch itself, I have query with my own match verification logic and therefore lack of leapfrog bothers me much.

Looking forward for you feedback!
--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: minShouldMatch can't leapfrog

Mikhail Khludnev
Hello Developers,

I appreciate your feedback. I raised the ticket for the last one which is more common and reachable gain. https://issues.apache.org/jira/browse/LUCENE-4571
Stefan kindly agreed to add more details into it. 

Happy Black Friday!


On Tue, Nov 13, 2012 at 11:59 PM, Mikhail Khludnev <[hidden email]> wrote:
Developers,

I want to discuss few points regarding disjunction form of BooleanQuery with minShouldMatch constraint. I'm talking about doc-at-time evaluation only (BooleanScorer2).
Look into conjunction query which has disjunction in one of its' clauses e.g. +foo +(bar baz …). if disjunction "(bar baz … )" has high minShouldMatch constraint even if conjunction clause +foo is highly selective this query performs quite bad. It also happens if instead of +foo you have a filter. Once again: it's reasonable that disjunction even with high minShouldMatch is expensive, if "core" disjunction with minShouldMatch=1 matches few millions of docs. The problem is that I can't speed it up by supplying highly selective filter.
From my POV there two points in Lucene API which make leapfrog impossible:
- advance() is obliged to return next matching doc that causes scan in nextDoc() loop. It's great to have something like advanceExact(), or return some magic value from advance says - "failed to advance, propose next for leapfrog";
- Scorer is obliged to jump on the first matching doc after it's created that leads to scan many docs in nextDoc() loop;
- ConjunctiveScorer can't know which of its legs are not able to leapfrog and prefer to decline advance()
- Stefan spotted one more gain for minShouldMatch in DisjunctionSumScorer: We don't need to walk through top of heap after top scorer is advanced. Instead of this, let's pop minShouldMatch scores from the heap, look at the top after this - top doc can be evaluated as potential match which exceeds minShouldMatch constraint. After that, let's push those guys back to the heap advancing them to that candidate docnum. It's not cohesioned with leapfrog problem, though.

This last one looks impressing, but I'm not smart enough to realize that it gives performance gain. Do you think it's valuable optimization for Lucene users? How minShouldMatch is popular,btw? To be honest I'm not really suffer form minShouldMatch itself, I have query with my own match verification logic and therefore lack of leapfrog bothers me much.

Looking forward for you feedback!
--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]





--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]