[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4198:
---------------------------------
    Attachment: LUCENE-4198.patch

I have taken another approach. Issue with {{setMinCompetitiveScore}} is that it usually cannot be efficiently leveraged to speed up eg. conjunctions. So I went with implementing ideas from the block-max WAND (BMW) paper (http://engineering.nyu.edu/~suel/papers/bmw.pdf): the patch introduces a new {{ImpactsEnum}} which extends {{PostingsEnum}} and introduces two APIs instead of {{setMinCompetitiveScore}}:
 - {{int advanceShallow(int target)}} to get scoring information for documents that start at {{target}}. The benefit compared to {{advance}} is that it only advances the skip list reader, which is much cheaper: no decoding is happening.
 - {{float getMaxScore(int upTo)}} wich gives information about scores for doc ids between the last target to {{advanceShallow}} and {{upTo}}, both included.

Currently only TermScorer leverages this, but the benefit is that we could add these APIs to Scorer as well in a follow-up issue so that WANDScorer and ConjunctionScorer could leverage them. I built a prototype already to make sure that there is an actual speedup for some queries, but I'm leaving it to a follow-up issue as indexing impacts is already challenging on its own. One thing that it made me change though is that the new patch also stores all impacts on the first level, which is written every 128 documents. This seemed important for conjunctions, since the maximum score on a given block is not always reached, on the contrary to term queries since they match all documents in the block. It makes it more important to have good bounds of the score with conjunctions than it is with term queries. The disk overhead is still acceptable to me: the wikimedium10 index is only 1.4% larger overall, and postings alone (the .doc file) is only 3.1% larger.

Here are the benchmark results:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
              AndHighLow     1128.91      (3.5%)      875.48      (2.3%)  -22.4% ( -27% -  -17%)
              AndHighMed      409.67      (2.0%)      331.98      (1.7%)  -19.0% ( -22% -  -15%)
               OrHighMed      264.99      (3.5%)      229.15      (3.0%)  -13.5% ( -19% -   -7%)
               OrHighLow      111.47      (4.5%)       98.00      (3.2%)  -12.1% ( -18% -   -4%)
              OrHighHigh       34.88      (4.2%)       31.69      (4.0%)   -9.1% ( -16% -   -1%)
            OrNotHighLow     1373.74      (5.2%)     1291.72      (4.1%)   -6.0% ( -14% -    3%)
               LowPhrase       78.14      (1.6%)       75.28      (1.2%)   -3.7% (  -6% -    0%)
               MedPhrase       47.49      (1.6%)       45.92      (1.2%)   -3.3% (  -5% -    0%)
         LowSloppyPhrase      208.43      (2.8%)      202.37      (2.7%)   -2.9% (  -8% -    2%)
                  Fuzzy1      300.99      (7.7%)      292.78      (8.0%)   -2.7% ( -17% -   13%)
             LowSpanNear       62.73      (1.4%)       61.09      (1.3%)   -2.6% (  -5% -    0%)
                  Fuzzy2      188.37      (7.9%)      184.16      (6.7%)   -2.2% ( -15% -   13%)
             MedSpanNear       57.41      (1.8%)       56.17      (1.5%)   -2.2% (  -5% -    1%)
         MedSloppyPhrase       23.21      (2.3%)       22.73      (2.3%)   -2.1% (  -6% -    2%)
              HighPhrase       48.75      (3.2%)       47.80      (3.6%)   -1.9% (  -8% -    4%)
            HighSpanNear       40.04      (2.9%)       39.35      (2.7%)   -1.7% (  -7% -    4%)
       HighTermMonthSort      228.21      (8.4%)      224.66      (7.9%)   -1.6% ( -16% -   16%)
        HighSloppyPhrase       25.96      (2.8%)       25.61      (3.0%)   -1.4% (  -6% -    4%)
                 Respell      284.85      (3.7%)      282.42      (4.0%)   -0.9% (  -8% -    7%)
                  IntNRQ       18.87      (5.3%)       18.86      (6.8%)   -0.1% ( -11% -   12%)
                Wildcard       85.50      (5.0%)       86.79      (4.0%)    1.5% (  -7% -   11%)
                 Prefix3      137.41      (6.5%)      141.61      (4.9%)    3.1% (  -7% -   15%)
   HighTermDayOfYearSort      116.58      (6.3%)      121.38      (7.2%)    4.1% (  -8% -   18%)
             AndHighHigh       37.64      (1.5%)      118.12      (6.7%)  213.8% ( 202% -  225%)
                 LowTerm      909.13      (2.2%)     3379.38     (11.2%)  271.7% ( 252% -  291%)
            OrNotHighMed      196.21      (1.7%)     1509.92     (28.9%)  669.6% ( 627% -  712%)
                 MedTerm      305.82      (1.7%)     2897.01     (42.5%)  847.3% ( 789% -  907%)
                HighTerm      108.94      (1.7%)     1191.54     (61.3%)  993.8% ( 915% - 1075%)
            OrHighNotMed       81.83      (0.5%)     1082.94     (63.2%) 1223.5% (1153% - 1294%)
           OrHighNotHigh       63.16      (0.7%)      995.35     (72.0%) 1475.8% (1393% - 1559%)
            OrHighNotLow       34.53      (0.9%)      557.35     (94.8%) 1514.1% (1406% - 1623%)
           OrNotHighHigh       51.95      (0.9%)      943.81     (73.7%) 1716.6% (1626% - 1808%)
{noformat}

Results are a bit less good than previously for two reasons:
 - The new API doesn't allow term queries to skip more than one block (128 docs) at a time. But I think the perspective of speeding up boolean queries too makes it a good trade-off, especially given that this is already quite a significant speed up for term queries.
 - The new implementation is more complex since it also (optionally) gives access to positions, offsets and payloads. I did not specialize the case that only frequencies are requested (hopefully this won't be necessary). Information about positions could be useful in the future to speed up phrase queries.

I wouldn't worry too much about the slowdown for some boolean queries. First the most affected queries are queries that are already quite fast. Plus I hope that follow-ups will allow us to get some performance back.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another problem is that it needs access to the term and collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]