[jira] [Commented] (LUCENE-8255) Can we make index sorting work for soft deletes

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-8255) Can we make index sorting work for soft deletes

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439948#comment-16439948 ]

Jim Ferenczi commented on LUCENE-8255:
--------------------------------------

{quote}
This also means that sorting such a segment on merge has a significant overhead. (I hope Jim Ferenczi can shed some light on it how much we would have to expect).
{quote}

Now that we apply index sorting at flush time we guarantee that merges can perform a "merge sort" of already sorted segments. Before https://issues.apache.org/jira/browse/LUCENE-7579 we had to re-sort segments but only those that were produced by a flush. If we accept updatable values in the index sort fields this means that the worst case for a merge would be to re-sort all segments. The only way to sort a segment currently is to use a SortingLeafReader which basically holds everything in memory (doc_values, postings, ...). This worked fine before LUCENE-7579 because flushed segments are supposed to be small but if we need to sort big segments as well we'd need a more specialized class that don't load everything in memory.

{quote}
We also need to add some special casing since we use "merge sorting" and can't go backwards in doc ID which would be violated if a segment received updates. (cc Adrien Grand)
{quote}

I think it's fine if we are able to produce a "sorted" view of the segments and then use a "merge sort" on top of these views like we used to do before LUCENE-7579. The merges of stored fields and fields should be equivalent to a merge with deleted documents but the merge of points fields would be problematic.
Currently for single-dimension points we produce the new bkd tree from a sorted stream so the producer doesn't need to resort all the values. This is one of the main reason for the speed up that you can see here:
https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput (annotation V).
This is a bit counter intuitive that we need to resort the points values since they are already sorted but the reason is that we need to ensure that documents with the same values are sorted by doc ids in the tree and this assumption is invalidated if we need to remap doc ids during the merge. I guess it would be possible to optimize this by only resorting the documents that share the same values so that's not a blocker but something we need to keep in mind.

Regarding the overhead if we use index sorting as it is (or as it were) it should be at least the same https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput before annotation V and I expect it to be higher since all segments in a merge would need resorting.


> Can we make index sorting work for soft deletes
> -----------------------------------------------
>
>                 Key: LUCENE-8255
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8255
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Simon Willnauer
>            Priority: Major
>
> I phrased this as a question since it's mainly a discussion. I spoke to [~rcmuir] on a couple of occasions about making index sorting work for soft deletes. The issue that prevents this is that soft deletes use updateable DV to mark docs as deleted. This basically means that a sorted segment is not guaranteed to be sorted if it has received any updates. This also means that sorting such a segment on merge has a significant overhead. (I hope [~jimczi] can shed some light on it how much we would have to expect). We also need to add some special casing since we use "merge sorting" and can't go backwards in doc ID which would be violated if a segment received updates. (cc [~jpountz])
> The main purpose of doing this is that "soft deleted" documents would either be at the end or in the beginning of the segment such that compression is better if these docs have larger retention policies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]