Help with huge index

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with huge index

Stuart Goldberg
I have huge lucene index. On disk it's about 24Gb.

 

I have a purging routine that is supposed to run and purge old docs.

 

There are about 650 million docs in there and through testing I have
determined that about 1/3 of these need to be purged.

 

During the purge, every so often it's apparently doing some flushing and
applying deletes. This causes the process to hang. I know it's not hanging,
but actually doing work because I have enabled infostream and I am getting
messages every so often (every 5 minutes).

 

Is there some trick (index config) I can employ to get this to work faster.

 

Stuart M Goldberg

Reply | Threaded
Open this post in threaded view
|

Re: Help with huge index

Adrien Grand
What do you mean by purging? What methods do you call?

Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <[hidden email]> a
écrit :

> I have huge lucene index. On disk it's about 24Gb.
>
>
>
> I have a purging routine that is supposed to run and purge old docs.
>
>
>
> There are about 650 million docs in there and through testing I have
> determined that about 1/3 of these need to be purged.
>
>
>
> During the purge, every so often it's apparently doing some flushing and
> applying deletes. This causes the process to hang. I know it's not hanging,
> but actually doing work because I have enabled infostream and I am getting
> messages every so often (every 5 minutes).
>
>
>
> Is there some trick (index config) I can employ to get this to work faster.
>
>
>
> Stuart M Goldberg
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Help with huge index

Stuart Goldberg
I call deleteDocuments

On Feb 28, 2018 8:16 PM, "Adrien Grand" <[hidden email]> wrote:

> What do you mean by purging? What methods do you call?
>
> Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <[hidden email]> a
> écrit :
>
> > I have huge lucene index. On disk it's about 24Gb.
> >
> >
> >
> > I have a purging routine that is supposed to run and purge old docs.
> >
> >
> >
> > There are about 650 million docs in there and through testing I have
> > determined that about 1/3 of these need to be purged.
> >
> >
> >
> > During the purge, every so often it's apparently doing some flushing and
> > applying deletes. This causes the process to hang. I know it's not
> hanging,
> > but actually doing work because I have enabled infostream and I am
> getting
> > messages every so often (every 5 minutes).
> >
> >
> >
> > Is there some trick (index config) I can employ to get this to work
> faster.
> >
> >
> >
> > Stuart M Goldberg
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Help with huge index

Adrien Grand
Thanks. Deleting lots of documents can indeed trigger a lot of work in the
Lucene side. First Lucene likely needs to rewrite the live docs of all your
segments and then this might trigger significant merging activity due to
the fact that Lucene tries to keep the number of deleted docs reasonable so
that most disk space is not spent on deleted docs. I can't think of
settings that would make it more efficient.

If you call deleteDocuments because you are eg. deleting data after a given
age, it would help to have time-based indices so that you would remove an
entire index at once rather than large portions of an index.

Le jeu. 1 mars 2018 à 01:20, Stuart Goldberg <[hidden email]> a
écrit :

> I call deleteDocuments
>
> On Feb 28, 2018 8:16 PM, "Adrien Grand" <[hidden email]> wrote:
>
> > What do you mean by purging? What methods do you call?
> >
> > Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <[hidden email]>
> a
> > écrit :
> >
> > > I have huge lucene index. On disk it's about 24Gb.
> > >
> > >
> > >
> > > I have a purging routine that is supposed to run and purge old docs.
> > >
> > >
> > >
> > > There are about 650 million docs in there and through testing I have
> > > determined that about 1/3 of these need to be purged.
> > >
> > >
> > >
> > > During the purge, every so often it's apparently doing some flushing
> and
> > > applying deletes. This causes the process to hang. I know it's not
> > hanging,
> > > but actually doing work because I have enabled infostream and I am
> > getting
> > > messages every so often (every 5 minutes).
> > >
> > >
> > >
> > > Is there some trick (index config) I can employ to get this to work
> > faster.
> > >
> > >
> > >
> > > Stuart M Goldberg
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Help with huge index

Stuart Goldberg
Thanks so much. I actually found that my purging routine finished after
about 35 minutes which is really acceptable given that this routine is
supposed to run during the overnight period.

On Feb 28, 2018 8:34 PM, "Adrien Grand" <[hidden email]> wrote:

> Thanks. Deleting lots of documents can indeed trigger a lot of work in the
> Lucene side. First Lucene likely needs to rewrite the live docs of all your
> segments and then this might trigger significant merging activity due to
> the fact that Lucene tries to keep the number of deleted docs reasonable so
> that most disk space is not spent on deleted docs. I can't think of
> settings that would make it more efficient.
>
> If you call deleteDocuments because you are eg. deleting data after a given
> age, it would help to have time-based indices so that you would remove an
> entire index at once rather than large portions of an index.
>
> Le jeu. 1 mars 2018 à 01:20, Stuart Goldberg <[hidden email]> a
> écrit :
>
> > I call deleteDocuments
> >
> > On Feb 28, 2018 8:16 PM, "Adrien Grand" <[hidden email]> wrote:
> >
> > > What do you mean by purging? What methods do you call?
> > >
> > > Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <[hidden email]
> >
> > a
> > > écrit :
> > >
> > > > I have huge lucene index. On disk it's about 24Gb.
> > > >
> > > >
> > > >
> > > > I have a purging routine that is supposed to run and purge old docs.
> > > >
> > > >
> > > >
> > > > There are about 650 million docs in there and through testing I have
> > > > determined that about 1/3 of these need to be purged.
> > > >
> > > >
> > > >
> > > > During the purge, every so often it's apparently doing some flushing
> > and
> > > > applying deletes. This causes the process to hang. I know it's not
> > > hanging,
> > > > but actually doing work because I have enabled infostream and I am
> > > getting
> > > > messages every so often (every 5 minutes).
> > > >
> > > >
> > > >
> > > > Is there some trick (index config) I can employ to get this to work
> > > faster.
> > > >
> > > >
> > > >
> > > > Stuart M Goldberg
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Help with huge index

Michael Sokolov-4
I wonder if you might not get better performance in a case like this if you
were ok taking your index off line, disabling merges, performing deletions
and only then enabling merges? This could be done on a copy of the index if
updates can be turned off or held in a queue, so that queries could still
be served during the maintenance.

However it's largely a theoretical question, since it seems everything
worked ok for you in the end.

On Feb 28, 2018 8:37 PM, "Stuart Goldberg" <[hidden email]> wrote:

> Thanks so much. I actually found that my purging routine finished after
> about 35 minutes which is really acceptable given that this routine is
> supposed to run during the overnight period.
>
> On Feb 28, 2018 8:34 PM, "Adrien Grand" <[hidden email]> wrote:
>
> > Thanks. Deleting lots of documents can indeed trigger a lot of work in
> the
> > Lucene side. First Lucene likely needs to rewrite the live docs of all
> your
> > segments and then this might trigger significant merging activity due to
> > the fact that Lucene tries to keep the number of deleted docs reasonable
> so
> > that most disk space is not spent on deleted docs. I can't think of
> > settings that would make it more efficient.
> >
> > If you call deleteDocuments because you are eg. deleting data after a
> given
> > age, it would help to have time-based indices so that you would remove an
> > entire index at once rather than large portions of an index.
> >
> > Le jeu. 1 mars 2018 à 01:20, Stuart Goldberg <[hidden email]> a
> > écrit :
> >
> > > I call deleteDocuments
> > >
> > > On Feb 28, 2018 8:16 PM, "Adrien Grand" <[hidden email]> wrote:
> > >
> > > > What do you mean by purging? What methods do you call?
> > > >
> > > > Le mer. 28 févr. 2018 à 19:34, Stuart Goldberg <
> [hidden email]
> > >
> > > a
> > > > écrit :
> > > >
> > > > > I have huge lucene index. On disk it's about 24Gb.
> > > > >
> > > > >
> > > > >
> > > > > I have a purging routine that is supposed to run and purge old
> docs.
> > > > >
> > > > >
> > > > >
> > > > > There are about 650 million docs in there and through testing I
> have
> > > > > determined that about 1/3 of these need to be purged.
> > > > >
> > > > >
> > > > >
> > > > > During the purge, every so often it's apparently doing some
> flushing
> > > and
> > > > > applying deletes. This causes the process to hang. I know it's not
> > > > hanging,
> > > > > but actually doing work because I have enabled infostream and I am
> > > > getting
> > > > > messages every so often (every 5 minutes).
> > > > >
> > > > >
> > > > >
> > > > > Is there some trick (index config) I can employ to get this to work
> > > > faster.
> > > > >
> > > > >
> > > > >
> > > > > Stuart M Goldberg
> > > > >
> > > > >
> > > >
> > >
> >
>