force deletes - terms enum still has deleted terms?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

force deletes - terms enum still has deleted terms?

Rob Audenaerde
Hi all,

We build a FST on the terms of our index by iterating the terms of the
readers for our fields, like this:

                        for (final LeafReaderContext ctx : leaves) {
                            final LeafReader leafReader = ctx.reader();

                            for (final String indexField : indexFields) {
                                final Terms terms =
leafReader.terms(indexField);
                                // If the field does not exist in this
reader, then we get null, so check for that.
                                if (terms != null) {
                                    final TermsEnum termsEnum =
terms.iterator();

However, it sometimes the building of the FST seems to find terms that are
from documents that are deleted. This is what we expect, checking the
javadocs.

So, now we switched the IndexWriter to a config with a TieredMergePolicy
with: setForceMergeDeletesPctAllowed(0).

When calling indexWriter.forceMergeDeletes(true) we expect that there will
be no more deletes. However, the deleted terms still sometimes appear. We
use the DirectoryReader.openIfChanged() to refresh the reader before
iterating the terms.

Are we forgetting something?

Thanks in advance.
Rob Audenaerde
Reply | Threaded
Open this post in threaded view
|

Re: force deletes - terms enum still has deleted terms?

Erick Erickson
You might be hitting a rounding error. When this happens, how many
deleted documents are there in the remaining segments? 1?

The calculation for whether to merge the segment is:

double pctDeletes = 100. * ((double) deleted_docs_in_segment /
(double) doc_count_in_segment_including_deleted_docs
if (pctDeletes > forceMergeDeletesPctAllowed) {merge the segment}.

At any rate, calling findForcedMerges instead will purge all deleted
docs no matter what.

NOTE: as of 7.5, the behavior has changed in that both of these
methods will respect the maximum segment size by default. Prior to
7.5, either of these could produce a single segment for all the
segments that were merged (all of them in forceMerge, all with > n%
deleted docs in forceMergeDeletes). If you require a single segment to
result, you can specify the maxSegmentCount as 1.

See LUCENE-7976 for all the gory details of this change if you're curious

Best,
Erick
On Fri, Sep 28, 2018 at 5:41 AM Rob Audenaerde <[hidden email]> wrote:

>
> Hi all,
>
> We build a FST on the terms of our index by iterating the terms of the
> readers for our fields, like this:
>
>                         for (final LeafReaderContext ctx : leaves) {
>                             final LeafReader leafReader = ctx.reader();
>
>                             for (final String indexField : indexFields) {
>                                 final Terms terms =
> leafReader.terms(indexField);
>                                 // If the field does not exist in this
> reader, then we get null, so check for that.
>                                 if (terms != null) {
>                                     final TermsEnum termsEnum =
> terms.iterator();
>
> However, it sometimes the building of the FST seems to find terms that are
> from documents that are deleted. This is what we expect, checking the
> javadocs.
>
> So, now we switched the IndexWriter to a config with a TieredMergePolicy
> with: setForceMergeDeletesPctAllowed(0).
>
> When calling indexWriter.forceMergeDeletes(true) we expect that there will
> be no more deletes. However, the deleted terms still sometimes appear. We
> use the DirectoryReader.openIfChanged() to refresh the reader before
> iterating the terms.
>
> Are we forgetting something?
>
> Thanks in advance.
> Rob Audenaerde

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]