[jira] Updated: (LUCENE-2410) Optimize PhraseQuery

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Updated: (LUCENE-2410) Optimize PhraseQuery

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2410:

    Attachment: LUCENE-2410.patch

Attached initial rough patch, doing the 1st and 3rd bullets above.
Still many nocommits, but all tests pass.

I only did this for the exact case (I don't understand the sloppy
case!), so I modified ExactPhraseScorer to no longer subclass
PhraseScorer and instead do everything on its own.

I tested on a 20M doc Wikipedia index, best of 10 runs:

||Query||No. hits||Trunk QPS||Patch QPS||Speedup||
|United States|314K|4.29|11.04|2.6X faster|
|United Kingdom Parliament|7K|20.33|58.57|2.9X faster|

The speedup is great :)

However, there's one problem w/ the patch that I must fix (and will
bring these gains down), which is it requires 2 int arrays sized to
the max position encountered during the search (which for a large doc
could be very large).  I think to make this committable I'd have to
switch to processing the positions in chunks (like BooleanScorer).

> Optimize PhraseQuery
> --------------------
>                 Key: LUCENE-2410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2410
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2410.patch, LUCENE-2410_rewrite.patch
> Looking the scorers for PhraseQuery, I think there are some speedups
> we could do:
>   * The AND part of the scorer (which advances to the next doc that
>     has all the terms), in PhraseScorer.doNext, should do the same
>     optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from
>     rarest to most frequent.  I don't think it should use a linked
>     list/firstToLast() that it does today.
>   * We do way too much work now when .score() is not called, because
>     we go and find all occurrences of the phrase in the doc, whereas
>     we should stop only after finding the first and then go and count
>     the rest if .score() is called.
>   * For the exact case, I think we can use two int arrays to find the
>     matches.  The first array holds the count of how many times a term
>     in the phrase "matched" a phrase starting at that position.  When
>     that count == the number of terms in the phrase, it's a match.
>     The 2nd is a "gen" array (holds docID when that count was last
>     touched), to avoid clearing.  Ie when incrementing the count, if
>     the docID != gen, we reset count to 0.  I think this'd be faster
>     than the PQ we now use.  Downside of this is if you have immense
>     docs (position gets very large) we'd need 2 immense arrays.
> It'd be great to do LUCENE-1252 along with this, ie factor
> PhraseScorer into two AND'd sub-scorers (LUCENE-1252 is open for
> this).  The first one should be ConjunctionScorer, and the 2nd one
> checks the positions (ie, either the exact or sloppy scorers).  This
> would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter
> is applied) we would save CPU by not checking the positions for a doc
> unless all other AND'd clauses accepted the doc.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]