[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324101#comment-16324101 ]

Adrien Grand commented on LUCENE-4198:
--------------------------------------

I tested wikibigall as well, which has the benefit of not having artificially truncated lengths like wikimedium:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
              AndHighLow     1440.24      (3.0%)      794.43      (2.9%)  -44.8% ( -49% -  -40%)
              AndHighMed      121.80      (1.4%)       94.75      (1.5%)  -22.2% ( -24% -  -19%)
             AndHighHigh       56.62      (1.2%)       45.26      (1.4%)  -20.1% ( -22% -  -17%)
               OrHighMed       93.16      (3.3%)       78.18      (3.1%)  -16.1% ( -21% -   -9%)
               OrHighLow      827.62      (2.6%)      748.49      (3.5%)   -9.6% ( -15% -   -3%)
              OrHighHigh       35.14      (4.4%)       32.25      (4.6%)   -8.2% ( -16% -    0%)
                  Fuzzy1      265.67      (4.7%)      246.12      (5.0%)   -7.4% ( -16% -    2%)
               LowPhrase      166.32      (1.3%)      157.61      (1.6%)   -5.2% (  -8% -   -2%)
                  Fuzzy2      184.41      (4.3%)      176.40      (3.5%)   -4.3% ( -11% -    3%)
             LowSpanNear      749.77      (2.1%)      726.14      (2.2%)   -3.2% (  -7% -    1%)
               MedPhrase       23.77      (2.0%)       23.14      (1.9%)   -2.6% (  -6% -    1%)
              HighPhrase       18.73      (3.0%)       18.24      (3.0%)   -2.6% (  -8% -    3%)
             MedSpanNear      113.11      (2.3%)      110.17      (2.0%)   -2.6% (  -6% -    1%)
         MedSloppyPhrase       10.28      (6.5%)       10.07      (6.9%)   -2.0% ( -14% -   12%)
         LowSloppyPhrase       12.68      (6.6%)       12.43      (7.1%)   -2.0% ( -14% -   12%)
        HighSloppyPhrase        9.47      (7.0%)        9.29      (7.5%)   -1.9% ( -15% -   13%)
                  IntNRQ       27.89      (7.0%)       27.58      (8.7%)   -1.1% ( -15% -   15%)
            HighSpanNear        9.05      (4.9%)        8.98      (4.7%)   -0.8% (  -9% -    9%)
                 Respell      273.80      (2.3%)      273.79      (2.2%)   -0.0% (  -4% -    4%)
       HighTermMonthSort       68.77      (7.1%)       69.60      (7.8%)    1.2% ( -12% -   17%)
                Wildcard       92.81      (5.8%)       94.67      (6.2%)    2.0% (  -9% -   14%)
   HighTermDayOfYearSort       61.99     (10.3%)       64.18     (10.9%)    3.5% ( -16% -   27%)
                 Prefix3       41.42      (8.3%)       42.96      (8.2%)    3.7% ( -11% -   22%)
                 LowTerm      694.99      (2.5%)     3126.69     (17.7%)  349.9% ( 321% -  379%)
                HighTerm       58.04      (2.7%)      490.60     (58.6%)  745.3% ( 666% -  828%)
                 MedTerm      120.80      (2.6%)     1053.44     (55.1%)  772.1% ( 695% -  852%)
{noformat}

{{.doc}} file is 5.2% larger and the index is 1.5% larger overall.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another problem is that it needs access to the term and collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]