[jira] [Commented] (LUCENE-4752) Merge segments to sort them

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-4752) Merge segments to sort them

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607282#comment-13607282 ]

Shai Erera commented on LUCENE-4752:
------------------------------------

What in the patch guarantees that any segment with more than maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which ensures every such segment gets picked up by SortingMP, however didn't find it.

I don't think that in general we should make assumptions based on a maxBufferedDocs setting because the default setting in IWC is per RAM consumption and also it seems slightly unrelated. I.e. if a segment is sorted, but has deletions such that numDocs < maxBufferedDocs, we do full collection, while we can early terminate as usual?

EarlyTerminatingCollector, I think, need not have getFullCollector. Rather it should wrap any other Collector (not limited to top doc) and if it detects a sorted segment in setNextReader (we still need to figure out how to detect that), early terminate after enough docs were seen, otherwise keep on calling in.collect()? It's the app's responsibility to wrap its collector (which could be ChainingCollector too) with this collector, and make sure that its early termination logic fits with its collectors. And so I don't think we need EarlyTerminationTopDocsCollector, but rather a concrete EarlyTerminatingCollector. BTW, EarlyTerminationTopDocsCollector has an uninitialized and unused maxUnsortedSize?

And hopefully we can stuff the early termination logic down to IndexSearcher eventually. There are other scenarios for early termination, such as time limit, and therefore I think it's ok if we have an EarlyTerminationException which IndexSearcher responds to.

Adrien, perhaps in order to keep the patch small, commit separately the changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap change)? These are good changes to commit anyway, and they only bloat out the patch and mask the actual issue's development? Is it possible?
               

> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>         Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, sorting_10M_ingestion.log
>
>
> It would be awesome if Lucene could write the documents out in a segment based on a configurable order.  This of course applies to merging segments to. The benefit is increased locality on disk of documents that are likely to be accessed together.  This often applies to documents near each other in time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]