Very high number of deleted docs, part 2

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Very high number of deleted docs, part 2

Markus Jelsma-2
Hello,

We discussed [1] this problem before, and we could not fix it until it became clear my collection was rather small, thanks again.

Another collection, now on 7.1, also shows this problem and has default TMP settings. This time size is different, each shard of this collection is over 40 GB, and each shard has about 50 % deleted documents. Each shard's largest segment is just under 20 GB with about 75 % deleted documents. After that are a few five/six GB segments with just under 50 % deleted documents.

What do i need to change to make Lucene believe that at least that twenty GB and three month old segment should be merged away. And how what would the predicted indexing performance penalty be?

Regarding reindexing frequency, each document is reindexed at least once every 30 days, some a more frequent. Updates are indexed every fifteen minutes orso.

Many thanks, Ḿ
arkus

[1] http://lucene.472066.n3.nabble.com/Very-high-number-of-deleted-docs-td4357327.html
Reply | Threaded
Open this post in threaded view
|

Re: Very high number of deleted docs, part 2

Shawn Heisey-2
On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> Another collection, now on 7.1, also shows this problem and has default TMP settings. This time size is different, each shard of this collection is over 40 GB, and each shard has about 50 % deleted documents. Each shard's largest segment is just under 20 GB with about 75 % deleted documents. After that are a few five/six GB segments with just under 50 % deleted documents.
>
> What do i need to change to make Lucene believe that at least that twenty GB and three month old segment should be merged away. And how what would the predicted indexing performance penalty be?

Quick answer: Erick's statements in the previous thread can be
summarized as this:  On large indexes that do a lot of deletes or
updates, once you do an optimize, you have to continue to do optimizes
regularly, or you're going to have this problem.

TL;DR:

I think Erick covered most of this (possibly all of it) in the previous
thread.

If you've got a 20GB segment and TMP's settings are default, then that
means at some point in the past, you've done an optimize.  The default
TMP settings have a maximum segment size of 5GB, so if you never
optimize, then there will never be a segment larger than 5GB, and the
deleted document percentage would be less likely to get out of control. 
The optimize operation ignores the maximum segment size and reduces the
index to a single large segment with zero deleted docs.

TMP's behavior with really big segments is apparently completely as the
author intended, but this specific problem wasn't ever addressed.

If you do an optimize once and then don't ever do it again, any very
large segments are going to be vulnerable to this problem, and the only
way (currently) to fix it is to do another optimize.

See this issue for a more in-depth discussion and an attempt to figure
out how to avoid it:

https://issues.apache.org/jira/browse/LUCENE-7976

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: Very high number of deleted docs, part 2

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
It could be that when this index was first reconstructed, it was optimized to one segment before packed and shipped.

How about optimizing it again, with maxSegments set to ten, it should recover right?

-----Original message-----

> From:Shawn Heisey <[hidden email]>
> Sent: Friday 5th January 2018 14:34
> To: [hidden email]
> Subject: Re: Very high number of deleted docs, part 2
>
> On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > Another collection, now on 7.1, also shows this problem and has default TMP settings. This time size is different, each shard of this collection is over 40 GB, and each shard has about 50 % deleted documents. Each shard's largest segment is just under 20 GB with about 75 % deleted documents. After that are a few five/six GB segments with just under 50 % deleted documents.
> >
> > What do i need to change to make Lucene believe that at least that twenty GB and three month old segment should be merged away. And how what would the predicted indexing performance penalty be?
>
> Quick answer: Erick's statements in the previous thread can be
> summarized as this:  On large indexes that do a lot of deletes or
> updates, once you do an optimize, you have to continue to do optimizes
> regularly, or you're going to have this problem.
>
> TL;DR:
>
> I think Erick covered most of this (possibly all of it) in the previous
> thread.
>
> If you've got a 20GB segment and TMP's settings are default, then that
> means at some point in the past, you've done an optimize.  The default
> TMP settings have a maximum segment size of 5GB, so if you never
> optimize, then there will never be a segment larger than 5GB, and the
> deleted document percentage would be less likely to get out of control. 
> The optimize operation ignores the maximum segment size and reduces the
> index to a single large segment with zero deleted docs.
>
> TMP's behavior with really big segments is apparently completely as the
> author intended, but this specific problem wasn't ever addressed.
>
> If you do an optimize once and then don't ever do it again, any very
> large segments are going to be vulnerable to this problem, and the only
> way (currently) to fix it is to do another optimize.
>
> See this issue for a more in-depth discussion and an attempt to figure
> out how to avoid it:
>
> https://issues.apache.org/jira/browse/LUCENE-7976
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Very high number of deleted docs, part 2

Erick Erickson
I'm not 100% sure that playing with maxSegments will work.

what will work is to re-index everything. You can re-index into the
existing collection, no need to start with a new collection. Eventually
you'll replace enough docs in the over-sized segments that they'll fall
under the 2.5G live documents limit and be merged away. Not elegant, but
it'd work.

Best,
Erick

On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <[hidden email]>
wrote:

> It could be that when this index was first reconstructed, it was optimized
> to one segment before packed and shipped.
>
> How about optimizing it again, with maxSegments set to ten, it should
> recover right?
>
> -----Original message-----
> > From:Shawn Heisey <[hidden email]>
> > Sent: Friday 5th January 2018 14:34
> > To: [hidden email]
> > Subject: Re: Very high number of deleted docs, part 2
> >
> > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > Another collection, now on 7.1, also shows this problem and has
> default TMP settings. This time size is different, each shard of this
> collection is over 40 GB, and each shard has about 50 % deleted documents.
> Each shard's largest segment is just under 20 GB with about 75 % deleted
> documents. After that are a few five/six GB segments with just under 50 %
> deleted documents.
> > >
> > > What do i need to change to make Lucene believe that at least that
> twenty GB and three month old segment should be merged away. And how what
> would the predicted indexing performance penalty be?
> >
> > Quick answer: Erick's statements in the previous thread can be
> > summarized as this:  On large indexes that do a lot of deletes or
> > updates, once you do an optimize, you have to continue to do optimizes
> > regularly, or you're going to have this problem.
> >
> > TL;DR:
> >
> > I think Erick covered most of this (possibly all of it) in the previous
> > thread.
> >
> > If you've got a 20GB segment and TMP's settings are default, then that
> > means at some point in the past, you've done an optimize.  The default
> > TMP settings have a maximum segment size of 5GB, so if you never
> > optimize, then there will never be a segment larger than 5GB, and the
> > deleted document percentage would be less likely to get out of control.
> > The optimize operation ignores the maximum segment size and reduces the
> > index to a single large segment with zero deleted docs.
> >
> > TMP's behavior with really big segments is apparently completely as the
> > author intended, but this specific problem wasn't ever addressed.
> >
> > If you do an optimize once and then don't ever do it again, any very
> > large segments are going to be vulnerable to this problem, and the only
> > way (currently) to fix it is to do another optimize.
> >
> > See this issue for a more in-depth discussion and an attempt to figure
> > out how to avoid it:
> >
> > https://issues.apache.org/jira/browse/LUCENE-7976
> >
> > Thanks,
> > Shawn
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Very high number of deleted docs, part 2

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Well, maxSegments with optimize or commit with expungeDeletes did not do the job in testing. But tell me more about 2.5G live documents limit, no idea what it is.

Thanks,
Markus
 
-----Original message-----

> From:Erick Erickson <[hidden email]>
> Sent: Friday 5th January 2018 17:56
> To: solr-user <[hidden email]>
> Subject: Re: Very high number of deleted docs, part 2
>
> I'm not 100% sure that playing with maxSegments will work.
>
> what will work is to re-index everything. You can re-index into the
> existing collection, no need to start with a new collection. Eventually
> you'll replace enough docs in the over-sized segments that they'll fall
> under the 2.5G live documents limit and be merged away. Not elegant, but
> it'd work.
>
> Best,
> Erick
>
> On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <[hidden email]>
> wrote:
>
> > It could be that when this index was first reconstructed, it was optimized
> > to one segment before packed and shipped.
> >
> > How about optimizing it again, with maxSegments set to ten, it should
> > recover right?
> >
> > -----Original message-----
> > > From:Shawn Heisey <[hidden email]>
> > > Sent: Friday 5th January 2018 14:34
> > > To: [hidden email]
> > > Subject: Re: Very high number of deleted docs, part 2
> > >
> > > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > > Another collection, now on 7.1, also shows this problem and has
> > default TMP settings. This time size is different, each shard of this
> > collection is over 40 GB, and each shard has about 50 % deleted documents.
> > Each shard's largest segment is just under 20 GB with about 75 % deleted
> > documents. After that are a few five/six GB segments with just under 50 %
> > deleted documents.
> > > >
> > > > What do i need to change to make Lucene believe that at least that
> > twenty GB and three month old segment should be merged away. And how what
> > would the predicted indexing performance penalty be?
> > >
> > > Quick answer: Erick's statements in the previous thread can be
> > > summarized as this:  On large indexes that do a lot of deletes or
> > > updates, once you do an optimize, you have to continue to do optimizes
> > > regularly, or you're going to have this problem.
> > >
> > > TL;DR:
> > >
> > > I think Erick covered most of this (possibly all of it) in the previous
> > > thread.
> > >
> > > If you've got a 20GB segment and TMP's settings are default, then that
> > > means at some point in the past, you've done an optimize.  The default
> > > TMP settings have a maximum segment size of 5GB, so if you never
> > > optimize, then there will never be a segment larger than 5GB, and the
> > > deleted document percentage would be less likely to get out of control.
> > > The optimize operation ignores the maximum segment size and reduces the
> > > index to a single large segment with zero deleted docs.
> > >
> > > TMP's behavior with really big segments is apparently completely as the
> > > author intended, but this specific problem wasn't ever addressed.
> > >
> > > If you do an optimize once and then don't ever do it again, any very
> > > large segments are going to be vulnerable to this problem, and the only
> > > way (currently) to fix it is to do another optimize.
> > >
> > > See this issue for a more in-depth discussion and an attempt to figure
> > > out how to avoid it:
> > >
> > > https://issues.apache.org/jira/browse/LUCENE-7976
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Very high number of deleted docs, part 2

Erick Erickson
There's some background here:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

the 2.5 "live" document limit is really "50% of the max segment size",
hard-coded in TieredMergePolicy.

bq: Well, maxSegments with optimize or commit with expungeDeletes did not
do the job in testing

Surprising. What actually happened? Do note that expungeDeletes does not
promise to remove all deleted docs, it merges segments with < (some
percentage) deleted documents.

Best,
Erick

On Wed, Jan 10, 2018 at 9:45 AM, Markus Jelsma <[hidden email]>
wrote:

> Well, maxSegments with optimize or commit with expungeDeletes did not do
> the job in testing. But tell me more about 2.5G live documents limit, no
> idea what it is.
>
> Thanks,
> Markus
>
> -----Original message-----
> > From:Erick Erickson <[hidden email]>
> > Sent: Friday 5th January 2018 17:56
> > To: solr-user <[hidden email]>
> > Subject: Re: Very high number of deleted docs, part 2
> >
> > I'm not 100% sure that playing with maxSegments will work.
> >
> > what will work is to re-index everything. You can re-index into the
> > existing collection, no need to start with a new collection. Eventually
> > you'll replace enough docs in the over-sized segments that they'll fall
> > under the 2.5G live documents limit and be merged away. Not elegant, but
> > it'd work.
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <
> [hidden email]>
> > wrote:
> >
> > > It could be that when this index was first reconstructed, it was
> optimized
> > > to one segment before packed and shipped.
> > >
> > > How about optimizing it again, with maxSegments set to ten, it should
> > > recover right?
> > >
> > > -----Original message-----
> > > > From:Shawn Heisey <[hidden email]>
> > > > Sent: Friday 5th January 2018 14:34
> > > > To: [hidden email]
> > > > Subject: Re: Very high number of deleted docs, part 2
> > > >
> > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > > > Another collection, now on 7.1, also shows this problem and has
> > > default TMP settings. This time size is different, each shard of this
> > > collection is over 40 GB, and each shard has about 50 % deleted
> documents.
> > > Each shard's largest segment is just under 20 GB with about 75 %
> deleted
> > > documents. After that are a few five/six GB segments with just under
> 50 %
> > > deleted documents.
> > > > >
> > > > > What do i need to change to make Lucene believe that at least that
> > > twenty GB and three month old segment should be merged away. And how
> what
> > > would the predicted indexing performance penalty be?
> > > >
> > > > Quick answer: Erick's statements in the previous thread can be
> > > > summarized as this:  On large indexes that do a lot of deletes or
> > > > updates, once you do an optimize, you have to continue to do
> optimizes
> > > > regularly, or you're going to have this problem.
> > > >
> > > > TL;DR:
> > > >
> > > > I think Erick covered most of this (possibly all of it) in the
> previous
> > > > thread.
> > > >
> > > > If you've got a 20GB segment and TMP's settings are default, then
> that
> > > > means at some point in the past, you've done an optimize.  The
> default
> > > > TMP settings have a maximum segment size of 5GB, so if you never
> > > > optimize, then there will never be a segment larger than 5GB, and the
> > > > deleted document percentage would be less likely to get out of
> control.
> > > > The optimize operation ignores the maximum segment size and reduces
> the
> > > > index to a single large segment with zero deleted docs.
> > > >
> > > > TMP's behavior with really big segments is apparently completely as
> the
> > > > author intended, but this specific problem wasn't ever addressed.
> > > >
> > > > If you do an optimize once and then don't ever do it again, any very
> > > > large segments are going to be vulnerable to this problem, and the
> only
> > > > way (currently) to fix it is to do another optimize.
> > > >
> > > > See this issue for a more in-depth discussion and an attempt to
> figure
> > > > out how to avoid it:
> > > >
> > > > https://issues.apache.org/jira/browse/LUCENE-7976
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Very high number of deleted docs, part 2

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Yes, i made sure the large test segment had just over 10 % deleted documents. But all that expungeDeletes did was merging that segment with itself, making it just 10 % smaller. It makes sense though. Optimizing with maxSegments is also not a possibility, it will just merge the cheapest segments to fullfil the maxSegments requirement.

But, thinking of it, the production segment is over 75 % deleted. Using expungeDeletes on production should reduce the segment to about 5 GB, making it eligible for regular merging again right?

Thanks,
Markus

-----Original message-----

> From:Erick Erickson <[hidden email]>
> Sent: Wednesday 10th January 2018 22:41
> To: solr-user <[hidden email]>
> Subject: Re: Very high number of deleted docs, part 2
>
> There's some background here:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>
> the 2.5 "live" document limit is really "50% of the max segment size",
> hard-coded in TieredMergePolicy.
>
> bq: Well, maxSegments with optimize or commit with expungeDeletes did not
> do the job in testing
>
> Surprising. What actually happened? Do note that expungeDeletes does not
> promise to remove all deleted docs, it merges segments with < (some
> percentage) deleted documents.
>
> Best,
> Erick
>
> On Wed, Jan 10, 2018 at 9:45 AM, Markus Jelsma <[hidden email]>
> wrote:
>
> > Well, maxSegments with optimize or commit with expungeDeletes did not do
> > the job in testing. But tell me more about 2.5G live documents limit, no
> > idea what it is.
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Erick Erickson <[hidden email]>
> > > Sent: Friday 5th January 2018 17:56
> > > To: solr-user <[hidden email]>
> > > Subject: Re: Very high number of deleted docs, part 2
> > >
> > > I'm not 100% sure that playing with maxSegments will work.
> > >
> > > what will work is to re-index everything. You can re-index into the
> > > existing collection, no need to start with a new collection. Eventually
> > > you'll replace enough docs in the over-sized segments that they'll fall
> > > under the 2.5G live documents limit and be merged away. Not elegant, but
> > > it'd work.
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Jan 5, 2018 at 6:46 AM, Markus Jelsma <
> > [hidden email]>
> > > wrote:
> > >
> > > > It could be that when this index was first reconstructed, it was
> > optimized
> > > > to one segment before packed and shipped.
> > > >
> > > > How about optimizing it again, with maxSegments set to ten, it should
> > > > recover right?
> > > >
> > > > -----Original message-----
> > > > > From:Shawn Heisey <[hidden email]>
> > > > > Sent: Friday 5th January 2018 14:34
> > > > > To: [hidden email]
> > > > > Subject: Re: Very high number of deleted docs, part 2
> > > > >
> > > > > On 1/5/2018 5:33 AM, Markus Jelsma wrote:
> > > > > > Another collection, now on 7.1, also shows this problem and has
> > > > default TMP settings. This time size is different, each shard of this
> > > > collection is over 40 GB, and each shard has about 50 % deleted
> > documents.
> > > > Each shard's largest segment is just under 20 GB with about 75 %
> > deleted
> > > > documents. After that are a few five/six GB segments with just under
> > 50 %
> > > > deleted documents.
> > > > > >
> > > > > > What do i need to change to make Lucene believe that at least that
> > > > twenty GB and three month old segment should be merged away. And how
> > what
> > > > would the predicted indexing performance penalty be?
> > > > >
> > > > > Quick answer: Erick's statements in the previous thread can be
> > > > > summarized as this:  On large indexes that do a lot of deletes or
> > > > > updates, once you do an optimize, you have to continue to do
> > optimizes
> > > > > regularly, or you're going to have this problem.
> > > > >
> > > > > TL;DR:
> > > > >
> > > > > I think Erick covered most of this (possibly all of it) in the
> > previous
> > > > > thread.
> > > > >
> > > > > If you've got a 20GB segment and TMP's settings are default, then
> > that
> > > > > means at some point in the past, you've done an optimize.  The
> > default
> > > > > TMP settings have a maximum segment size of 5GB, so if you never
> > > > > optimize, then there will never be a segment larger than 5GB, and the
> > > > > deleted document percentage would be less likely to get out of
> > control.
> > > > > The optimize operation ignores the maximum segment size and reduces
> > the
> > > > > index to a single large segment with zero deleted docs.
> > > > >
> > > > > TMP's behavior with really big segments is apparently completely as
> > the
> > > > > author intended, but this specific problem wasn't ever addressed.
> > > > >
> > > > > If you do an optimize once and then don't ever do it again, any very
> > > > > large segments are going to be vulnerable to this problem, and the
> > only
> > > > > way (currently) to fix it is to do another optimize.
> > > > >
> > > > > See this issue for a more in-depth discussion and an attempt to
> > figure
> > > > > out how to avoid it:
> > > > >
> > > > > https://issues.apache.org/jira/browse/LUCENE-7976
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > > >
> > > >
> > >
> >
>