Re: [jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: [jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

Erick Erickson
Brief comment (without reading all the comments, so FWIW) on segment merging and sort order. The default TMP will not guarantee anything; the “cost” of a merge takes into account a number of things. I _think_ the index will tend towards retaining the insertion order in an insert-only index, but by no means is that guaranteed (or even checked).

That said, if it’s an insert-only index, extending TMP or creating an “InsertOnlyMergePolicy” would be pretty easy. You’d want to borrow a few concepts from TMP when it comes to merging “like sized” segments rather than rewriting hugely dissimilar segments, and maybe make the segment max larger, but it should be pretty straight-forward. You’d also want to order the actual merge once the list of segments to merge was determined, which I’m not absolutely sure happens (no evidence one way or the other).

All this may be totally irrelevant, and for all I know there’s already one at the Lucene level, haven’t checked.


> On Mar 12, 2020, at 09:29, Michael Sokolov (Jira) <[hidden email]> wrote:
>    [ ]
> Michael Sokolov commented on LUCENE-8929:
> -----------------------------------------
> Thanks for the insightful comments, [~jim.ferenczi], you've given me a lot to think about! I had not really considered sorting segments: that makes a lot of sense when documents are at least roughly inserted in sort order. I would have thought merges would interfere with that opto, but I guess for the most part it works out? The performance improvements you saw are stunning. It would be great if we could get the segment sorting ideas merged into the Lucene code base, no? I wonder how we determine when they are applicable though. In Elasticsearch is it done based on some a-priori knowledge, or do you analyze the distribution and turn on the opto automatically? That would be compelling I think. On the other hand, the use case inspiring this does not tend to correlate index sort order and insertion order, so I don't think it would benefit as much from segment sorting (except due to chance, or in special cases), so I think these are really two separate optimizations and issues. We should be sure to structure the code in such a way that can accomodate them all and properly choose which one to apply. We don't have a formal query planner in Lucene, but I guess we are beginning to evolve one.
> I think the idea of splitting collectors is a good one, to avoid overmuch complexity in a single collector, but there is also a good deal of shared code across these. I can give that a try and see what it looks like.
> By the way, I did also run a test using luceneutil's "modification timestamp" field as the index sort and saw similar gains. I think that field is more tightly correlated with insertion order, and also has much higher cardinality, so it makes a good counterpoint: I'll post results here later once I can do a workup.
> I hear your concern about the non-determinism due to tie-breaking, but I * think* this is accounted for by including (global) docid in the comparison in MaxScoreTerminator.LeafState? I may be missing something though. It doesn't seem we have a good unit test checking for this tiebreak. I'll add to TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is covered.
> I'm not sure what to say about the `LeafFieldComparator` idea - it sounds powerful, but I am also a bit leery of these complex Comparators - they make other things more difficult since it becomes challenging to reason about the sort order "from the outside". I had to resort to some "instanceof" hackery to restrict consideration to cases where the comparator is numeric, and extracting the sort value from the comparator is pretty messy too. We pay a complexity cost here to handle some edge cases of more abstract comparators.  
>> Early Terminating CollectorManager
>> ----------------------------------
>>                Key: LUCENE-8929
>>                URL:
>>            Project: Lucene - Core
>>         Issue Type: Sub-task
>>           Reporter: Atri Sharma
>>           Priority: Major
>>         Time Spent: 7h 20m
>> Remaining Estimate: 0h
>> We should have an early terminating collector manager which accurately tracks hits across all of its collectors and determines when there are enough hits, allowing all the collectors to abort.
>> The options for the same are:
>> 1) Shared total count : Global "scoreboard" where all collectors update their current hit count. At the end of each document's collection, collector checks if N > threshold, and aborts if true
>> 2) State Reporting Collectors: Collectors report their total number of counts collected periodically using a callback mechanism, and get a proceed or abort decision.
>> 1) has the overhead of synchronization in the hot path, 2) can collect unnecessary hits before aborting.
>> I am planning to work on 2), unless objections
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]