SolrCloud delete by query performance

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud delete by query performance

Ryan Cutter
I have a collection with 1 billion documents and I want to delete 500 of
them.  The collection has a dozen shards and a couple replicas.  Using Solr
4.4.

Sent the delete query via HTTP:

http://hostname:8983/solr/my_collection/update?stream.body=
<delete><query>source:foo</query></delete>

Took a couple minutes and several replicas got knocked into Recovery mode.
They eventually came back and the desired docs were deleted but the cluster
wasn't thrilled (high load, etc).

Is this expected behavior?  Is there a better way to delete documents that
I'm missing?

Thanks, Ryan
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud delete by query performance

Shawn Heisey-2
On 5/20/2015 5:41 PM, Ryan Cutter wrote:

> I have a collection with 1 billion documents and I want to delete 500 of
> them.  The collection has a dozen shards and a couple replicas.  Using Solr
> 4.4.
>
> Sent the delete query via HTTP:
>
> http://hostname:8983/solr/my_collection/update?stream.body=
> <delete><query>source:foo</query></delete>
>
> Took a couple minutes and several replicas got knocked into Recovery mode.
> They eventually came back and the desired docs were deleted but the cluster
> wasn't thrilled (high load, etc).
>
> Is this expected behavior?  Is there a better way to delete documents that
> I'm missing?

That's the correct way to do the delete.  Before you'll see the change,
a commit must happen in one way or another.  Hopefully you already knew
that.

I believe that your setup has some performance issues that are making it
very slow and knocking out your Solr nodes temporarily.

The most common root problems with SolrCloud and indexes going into
recovery are:  1) Your heap is enormous but your garbage collection is
not tuned.  2) You don't have enough RAM, separate from your Java heap,
for adequate index caching.  With a billion documents in your
collection, you might even be having problems with both.

Here's a wiki page that includes some info on both of these problems,
plus a few others:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud delete by query performance

Ryan Cutter
GC is operating the way I think it should but I am lacking memory.  I am
just surprised because indexing is performing fine (documents going in) but
deletions are really bad (documents coming out).

Is it possible these deletes are hitting many segments, each of which I
assume must be re-built?  And if there isn't much slack memory laying
around to begin with, there's a bunch of contention/swap?

Thanks Shawn!

On Wed, May 20, 2015 at 4:50 PM, Shawn Heisey <[hidden email]> wrote:

> On 5/20/2015 5:41 PM, Ryan Cutter wrote:
> > I have a collection with 1 billion documents and I want to delete 500 of
> > them.  The collection has a dozen shards and a couple replicas.  Using
> Solr
> > 4.4.
> >
> > Sent the delete query via HTTP:
> >
> > http://hostname:8983/solr/my_collection/update?stream.body=
> > <delete><query>source:foo</query></delete>
> >
> > Took a couple minutes and several replicas got knocked into Recovery
> mode.
> > They eventually came back and the desired docs were deleted but the
> cluster
> > wasn't thrilled (high load, etc).
> >
> > Is this expected behavior?  Is there a better way to delete documents
> that
> > I'm missing?
>
> That's the correct way to do the delete.  Before you'll see the change,
> a commit must happen in one way or another.  Hopefully you already knew
> that.
>
> I believe that your setup has some performance issues that are making it
> very slow and knocking out your Solr nodes temporarily.
>
> The most common root problems with SolrCloud and indexes going into
> recovery are:  1) Your heap is enormous but your garbage collection is
> not tuned.  2) You don't have enough RAM, separate from your Java heap,
> for adequate index caching.  With a billion documents in your
> collection, you might even be having problems with both.
>
> Here's a wiki page that includes some info on both of these problems,
> plus a few others:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud delete by query performance

Shawn Heisey-2
On 5/20/2015 5:57 PM, Ryan Cutter wrote:
> GC is operating the way I think it should but I am lacking memory.  I am
> just surprised because indexing is performing fine (documents going in) but
> deletions are really bad (documents coming out).
>
> Is it possible these deletes are hitting many segments, each of which I
> assume must be re-built?  And if there isn't much slack memory laying
> around to begin with, there's a bunch of contention/swap?

A deleteByQuery must first query the entire index to determine which IDs
to delete.  That's going to hit every segment.  In the case of
SolrCloud, it will also hit at least one replica of every single shard
in the collection.

If the data required to satisfy the query is not already sitting in the
OS disk cache, then the actual disk must be read.  When RAM is extremely
tight, any disk operation will erase relevant data out of the OS disk
cache, so the next time it is needed, it must be read off the disk
again.  Disks are SLOW.  What I am describing is not swap, but the
performance impact is similar to swapping.

The actual delete operation (once the IDs are known) doesn't touch any
segments ... it writes Lucene document identifiers to a .del file, and
that file is consulted on all queries.  Any deleted documents found in
the query results are removed.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud delete by query performance

Ryan Cutter
Shawn, thank you very much for that explanation.  It helps a lot.

Cheers, Ryan

On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey <[hidden email]> wrote:

> On 5/20/2015 5:57 PM, Ryan Cutter wrote:
> > GC is operating the way I think it should but I am lacking memory.  I am
> > just surprised because indexing is performing fine (documents going in)
> but
> > deletions are really bad (documents coming out).
> >
> > Is it possible these deletes are hitting many segments, each of which I
> > assume must be re-built?  And if there isn't much slack memory laying
> > around to begin with, there's a bunch of contention/swap?
>
> A deleteByQuery must first query the entire index to determine which IDs
> to delete.  That's going to hit every segment.  In the case of
> SolrCloud, it will also hit at least one replica of every single shard
> in the collection.
>
> If the data required to satisfy the query is not already sitting in the
> OS disk cache, then the actual disk must be read.  When RAM is extremely
> tight, any disk operation will erase relevant data out of the OS disk
> cache, so the next time it is needed, it must be read off the disk
> again.  Disks are SLOW.  What I am describing is not swap, but the
> performance impact is similar to swapping.
>
> The actual delete operation (once the IDs are known) doesn't touch any
> segments ... it writes Lucene document identifiers to a .del file, and
> that file is consulted on all queries.  Any deleted documents found in
> the query results are removed.
>
> Thanks,
> Shawn
>
>