[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203135#comment-16203135 ]

Adrien Grand commented on LUCENE-4100:
--------------------------------------

Thanks for looking!

bq. Can we avoid the ScoreMode.merge? This seems really, really confusing. In general I don't think we should support such merging in MultiCollector or anywhere else, we should simply throw exception if things are different.

We have been allowing it until now (by ORing the needsScores booleans) so I didn't want to break this behaviour. I can remove this method and make the logic contained in MultiCollector by essentially returning COMPLETE if any of the collectors' score modes are different but I think failing would be surprising to many users and would probably need to be a 8.0 change if we decide to do it?

bq. Perhaps instead of the enum two booleans would be easier for now.

This is what I wanted to do first but I didn'l like the fact that it would allow passing needsScores=false and needsTotalHits=false, which doesn't make sense. If you still prefer the two booleans approach despite this, I'm happy to make the change.

bq. I don't understand why we should set the totalHitCount to -1, vs setting to a useful approximation, like google. The user said they didn't need the exact total hit count, so it should be no surprise, and its a hell of a lot more useful than a negative number.

I agree with that statement, but how do we compute a good estimate? It sounds challenging as the number of collected documents might be much less than the actual number of hits while the cost of the scorer can be highly overestimated, eg. for phrase queries. Should I return the number of collected documents and add documentation that this is a lower bound of the total number of hits?

> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, gsoc2014, patch, performance
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-4100.patch, LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with example queries and lucenebench, the package of Mike McCandless, resulting in very significant speedups.
> This ticket is to get started the discussion on including the implementation into Lucene's codebase. Because the technique requires awareness about it from the Lucene user/developer, it seems best to become a contrib/module package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]