Solr Deletes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Deletes

Dwane Hall
Hey Solr users,



I'd really appreciate some community advice if somebody can spare some time to assist me.  My question relates to initially deleting a large amount of unwanted data from a Solr Cloud collection, and then advice on best patterns for managing delete operations on a regular basis.   We have a situation where data in our index can be 're-mastered' and as a result orphan records are left dormant and unneeded in the index (think of a scenario similar to client resolution where an entity can switch between golden records depending on the information available at the time).  I'm considering removing these dormant records with a large initial bulk delete, and then running a delete process on a regular maintenance basis.  The initial record backlog is ~50million records in a ~1.2billion document index (~4%) and the maintenance deletes are small in comparison ~20,000/week.



So with this scenario in mind I'm wondering what my best approach is for the initial bulk delete:

  1.  Do nothing with the initial backlog and remove the unwanted documents during the next large reindexing process?
  2.  Delete by query (DBQ) with a specific delete query using the document id's?
  3.  Delete by id (DBID)?

Are there any significant performance advantages between using DBID over a specific DBQ? Should I break the delete operations up into batches of say 1000, 10000, 100000, N DOC_ID's at a time if I take this approach?



The Solr Reference guide mentions DBQ ignores the commitWithin parameter but you can specify multiple documents to remove with an OR (||) clause in a DBQ i.e.


Option 1 – Delete by id

{"delete":["<id1>","<id2>"]}



Option 2 – Delete by query (commitWithin ignored)

{"delete":{"query":"DOC_ID:(<id1> || <id2>)"}}



Shawn also provides a great explanation in this user group post from 2015 of the DBQ process (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)



I follow the Solr release notes fairly closely and also noticed this excellent addition and discussion from Hossman and committers in the Solr 8.5 release and it looks ideal for this scenario (https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're still on the 7.7.2 branch and are unable to take advantage of the streaming deletes feature.



If I do implement a weekly delete maintenance regime is there any advice the community can offer from experience?  I'll definitely want to avoid times of heavy indexing but how do deletes effect query performance?  Will users notice decreased performance during delete operations so they should be avoided during peak query windows as well?



As always any advice greatly is appreciated,



Thanks,



Dwane



Environment

SolrCloud 7.7.2, 30 shards, 2 replicas

~3 qps during peak times
Reply | Threaded
Open this post in threaded view
|

Re: Solr Deletes

Emir Arnautović
Hi Dwane,
DBQ does not play well with concurrent updates - it’ll block updates on replicas causing replicas to fall behind, trigger full replication and potentially OOM. My advice is to go with cursors (or even better use some DB as source of IDs) and DBID with some batching. You’ll need some tests to see which test size is best in your case.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 May 2020, at 01:48, Dwane Hall <[hidden email]> wrote:
>
> Hey Solr users,
>
>
>
> I'd really appreciate some community advice if somebody can spare some time to assist me.  My question relates to initially deleting a large amount of unwanted data from a Solr Cloud collection, and then advice on best patterns for managing delete operations on a regular basis.   We have a situation where data in our index can be 're-mastered' and as a result orphan records are left dormant and unneeded in the index (think of a scenario similar to client resolution where an entity can switch between golden records depending on the information available at the time).  I'm considering removing these dormant records with a large initial bulk delete, and then running a delete process on a regular maintenance basis.  The initial record backlog is ~50million records in a ~1.2billion document index (~4%) and the maintenance deletes are small in comparison ~20,000/week.
>
>
>
> So with this scenario in mind I'm wondering what my best approach is for the initial bulk delete:
>
>  1.  Do nothing with the initial backlog and remove the unwanted documents during the next large reindexing process?
>  2.  Delete by query (DBQ) with a specific delete query using the document id's?
>  3.  Delete by id (DBID)?
>
> Are there any significant performance advantages between using DBID over a specific DBQ? Should I break the delete operations up into batches of say 1000, 10000, 100000, N DOC_ID's at a time if I take this approach?
>
>
>
> The Solr Reference guide mentions DBQ ignores the commitWithin parameter but you can specify multiple documents to remove with an OR (||) clause in a DBQ i.e.
>
>
> Option 1 – Delete by id
>
> {"delete":["<id1>","<id2>"]}
>
>
>
> Option 2 – Delete by query (commitWithin ignored)
>
> {"delete":{"query":"DOC_ID:(<id1> || <id2>)"}}
>
>
>
> Shawn also provides a great explanation in this user group post from 2015 of the DBQ process (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)
>
>
>
> I follow the Solr release notes fairly closely and also noticed this excellent addition and discussion from Hossman and committers in the Solr 8.5 release and it looks ideal for this scenario (https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're still on the 7.7.2 branch and are unable to take advantage of the streaming deletes feature.
>
>
>
> If I do implement a weekly delete maintenance regime is there any advice the community can offer from experience?  I'll definitely want to avoid times of heavy indexing but how do deletes effect query performance?  Will users notice decreased performance during delete operations so they should be avoided during peak query windows as well?
>
>
>
> As always any advice greatly is appreciated,
>
>
>
> Thanks,
>
>
>
> Dwane
>
>
>
> Environment
>
> SolrCloud 7.7.2, 30 shards, 2 replicas
>
> ~3 qps during peak times

Reply | Threaded
Open this post in threaded view
|

Re: Solr Deletes

Erick Erickson
In reply to this post by Dwane Hall
Dwane:

DBQ for very large deletes is “iffy”. The problem is this: Solr must lock out _all_ indexing for _all_ replicas while the DBQ runs and this can just take a long time. This is just a consequence of distributed computing. Imagine a scenario where one of the documents affected by the DBQ is added by some other process. That has to be processed in order relative to the DBQ, but the DBQ can take a long time to find and delete the docs. But this has other implications, namely if updates don’t complete in a timely manner, the leader can throw the replicas into recovery...

So best practice is to go ahead and use delete-by-id. Do note that this means you’re responsible for resolving the issue above, but in your case it sounds like you’re guaranteed that none of the docs being deleted will be modified during the operation so you can ignore it.

What I’d do is use streaming to get my IDs (this is not using the link you provided, this is essentially doing that patch yourself but on the client) and use that to generate delete-by-id requests. This is just something like

create search stream source
while (more tuples) {
   assemble delete-by-id request (perhaps one with multiple IDs)
   send to Solr
}
don’t forget to send the last batch of deletes if you’re sending batches, I have ;)

Joel Bernstein’s blog is the most authoritative source, see: https://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html. Although IDK whether that example is up to date, but it’ll give you an idea where to start. And Joel is pretty responsive about questions….
 
I'd package up maybe 1,000 ids per request.  I regularly package up that many updates, and deletes are relatively cheap. You’ll avoid the overhead of establishing a request for every ID. This may seem contrary to the points above about DBQ taking a long time, but we’re talking orders of magnitude differences in the time it takes to delete 1,000 docs and query/delete vastly larger numbers, plus this does not require all the indexes be locked.

Your users likely won’t notice this running, so while it’s usually good practice to do maintenance during off hours, I wouldn’t stress about it.

And a question you didn’t ask for extra credit ;). The streaming expression will _not_ reflect any changes to the collection while it runs. The underlying index searcher is kept open for the duration and it only knows about segments that were closed when it started. But let’s assume your autocommit interval expires while this process is running and opens a new searcher. _Other_ requests from other clients _will_ see the changes. Again, I doubt you care since I’m assuming your orphan records are never seen by other clients anyway.

Best,
Erick

> On May 25, 2020, at 7:48 PM, Dwane Hall <[hidden email]> wrote:
>
> Hey Solr users,
>
>
>
> I'd really appreciate some community advice if somebody can spare some time to assist me.  My question relates to initially deleting a large amount of unwanted data from a Solr Cloud collection, and then advice on best patterns for managing delete operations on a regular basis.   We have a situation where data in our index can be 're-mastered' and as a result orphan records are left dormant and unneeded in the index (think of a scenario similar to client resolution where an entity can switch between golden records depending on the information available at the time).  I'm considering removing these dormant records with a large initial bulk delete, and then running a delete process on a regular maintenance basis.  The initial record backlog is ~50million records in a ~1.2billion document index (~4%) and the maintenance deletes are small in comparison ~20,000/week.
>
>
>
> So with this scenario in mind I'm wondering what my best approach is for the initial bulk delete:
>
>  1.  Do nothing with the initial backlog and remove the unwanted documents during the next large reindexing process?
>  2.  Delete by query (DBQ) with a specific delete query using the document id's?
>  3.  Delete by id (DBID)?
>
> Are there any significant performance advantages between using DBID over a specific DBQ? Should I break the delete operations up into batches of say 1000, 10000, 100000, N DOC_ID's at a time if I take this approach?
>
>
>
> The Solr Reference guide mentions DBQ ignores the commitWithin parameter but you can specify multiple documents to remove with an OR (||) clause in a DBQ i.e.
>
>
> Option 1 – Delete by id
>
> {"delete":["<id1>","<id2>"]}
>
>
>
> Option 2 – Delete by query (commitWithin ignored)
>
> {"delete":{"query":"DOC_ID:(<id1> || <id2>)"}}
>
>
>
> Shawn also provides a great explanation in this user group post from 2015 of the DBQ process (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)
>
>
>
> I follow the Solr release notes fairly closely and also noticed this excellent addition and discussion from Hossman and committers in the Solr 8.5 release and it looks ideal for this scenario (https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're still on the 7.7.2 branch and are unable to take advantage of the streaming deletes feature.
>
>
>
> If I do implement a weekly delete maintenance regime is there any advice the community can offer from experience?  I'll definitely want to avoid times of heavy indexing but how do deletes effect query performance?  Will users notice decreased performance during delete operations so they should be avoided during peak query windows as well?
>
>
>
> As always any advice greatly is appreciated,
>
>
>
> Thanks,
>
>
>
> Dwane
>
>
>
> Environment
>
> SolrCloud 7.7.2, 30 shards, 2 replicas
>
> ~3 qps during peak times

Reply | Threaded
Open this post in threaded view
|

Re: Solr Deletes

Bram Van Dam
On 26/05/2020 14:07, Erick Erickson wrote:
> So best practice is to go ahead and use delete-by-id.


I've noticed that this can cause issues when using implicit routing, at
least on 7.x. Though I can't quite remember whether the issue was a
performance issue, or whether documents would sometimes not get deleted.

In either case, I worked it around it by doing something like this:

UpdateRequest req = new UpdateRequest();
req.deleteById(id);
req.setCommitWithin(-1);
req.setParam(ShardParams._ROUTE_, shard);

Maybe that'll help if you run into either of those issues.

 - Bram
Reply | Threaded
Open this post in threaded view
|

Re: Solr Deletes

Dwane Hall
Thank you very much Erick, Emir, and Bram this is extremly useful advice I sincerely appreciate everyone’s input!


Before I received your responses I ran a controlled DBQ test in our DR environment and exactly what you said occurred.  It was like reading a step by step playbook of events with heavy blocking occurring on the Solr nodes and lots of threads going into a TIMED_WAITING state. Several shards were pushed into recovery mode and things were starting to get ugly, fast!


I'd read snippets in blog posts and JIRA tickets on DBQ being a blocking operation but I did not expect having such a specific DBQ (i.e. by ID's) would operate very differently from the DBID (which I expected block as well). Boy was I wrong! They're used interchangeably in the Solr ref guide examples so it’s very useful to understand the performance implications of each.  Additionally all of the information I found on delete operations never mentioned query performance so I was unsure of its impact in this dimension.


Erik thanks again for your comprehensive response your blogs and user group responses are always a pleasure to read I'm constantly picking useful pieces of information that I use on a daily basis in managing our Solr/Fusion clusters. Additionally, I've been looking for an excuse to use streaming expressions and I did not think to use them the way you suggested.  I've watched quite a few of Joel's presentations on youtube and his blog is brilliant.  Streaming expressions are expanding with every Solr release they really are a very exciting part of Solr's evolution.  Your final point on searcher state while streaming expressions are running and its relationship with new searchers is a very interesting additional piece of information I’ll add to the toolbox. Thank you.



At the moment we're fortunate to have all the ID's of the documents to remove in a DB so I'll be able to construct batches of DBID requests relatively easily and store them in a backlog table for processing without needing to traverse Solr with cursors, streaming (or other means) to identify them.  We follow a similar approach for updates in batches of around ~1000 docs/batch.  Inspiration for that sweet spot was once again determined after reading one of Erik's Lucidworks blog posts and testing (https://lucidworks.com/post/really-batch-updates-solr-2/).



Again thanks to the community and users for everyone’s contribution on the issue it is very much appreciated.


Successful Solr-ing to all,


Dwane

________________________________
From: Bram Van Dam <[hidden email]>
Sent: Wednesday, 27 May 2020 5:34 AM
To: [hidden email] <[hidden email]>
Subject: Re: Solr Deletes

On 26/05/2020 14:07, Erick Erickson wrote:
> So best practice is to go ahead and use delete-by-id.


I've noticed that this can cause issues when using implicit routing, at
least on 7.x. Though I can't quite remember whether the issue was a
performance issue, or whether documents would sometimes not get deleted.

In either case, I worked it around it by doing something like this:

UpdateRequest req = new UpdateRequest();
req.deleteById(id);
req.setCommitWithin(-1);
req.setParam(ShardParams._ROUTE_, shard);

Maybe that'll help if you run into either of those issues.

 - Bram
Reply | Threaded
Open this post in threaded view
|

Re: Solr Deletes

Cassandra Targett
I’m coming in a little late, but as of 8.5 there is a new streaming expression designed for DBQ situations which basically does what Erick was suggesting - gets a list of IDs for a query then does a delete by ID: https://lucene.apache.org/solr/guide/8_5/stream-decorator-reference.html#delete.

It won’t help if you’re not on 8.5, but going forward will be a good option for large delete sets.
On May 26, 2020, 8:09 PM -0500, Dwane Hall <[hidden email]>, wrote:

> Thank you very much Erick, Emir, and Bram this is extremly useful advice I sincerely appreciate everyone’s input!
>
>
> Before I received your responses I ran a controlled DBQ test in our DR environment and exactly what you said occurred. It was like reading a step by step playbook of events with heavy blocking occurring on the Solr nodes and lots of threads going into a TIMED_WAITING state. Several shards were pushed into recovery mode and things were starting to get ugly, fast!
>
>
> I'd read snippets in blog posts and JIRA tickets on DBQ being a blocking operation but I did not expect having such a specific DBQ (i.e. by ID's) would operate very differently from the DBID (which I expected block as well). Boy was I wrong! They're used interchangeably in the Solr ref guide examples so it’s very useful to understand the performance implications of each. Additionally all of the information I found on delete operations never mentioned query performance so I was unsure of its impact in this dimension.
>
>
> Erik thanks again for your comprehensive response your blogs and user group responses are always a pleasure to read I'm constantly picking useful pieces of information that I use on a daily basis in managing our Solr/Fusion clusters. Additionally, I've been looking for an excuse to use streaming expressions and I did not think to use them the way you suggested. I've watched quite a few of Joel's presentations on youtube and his blog is brilliant. Streaming expressions are expanding with every Solr release they really are a very exciting part of Solr's evolution. Your final point on searcher state while streaming expressions are running and its relationship with new searchers is a very interesting additional piece of information I’ll add to the toolbox. Thank you.
>
>
>
> At the moment we're fortunate to have all the ID's of the documents to remove in a DB so I'll be able to construct batches of DBID requests relatively easily and store them in a backlog table for processing without needing to traverse Solr with cursors, streaming (or other means) to identify them. We follow a similar approach for updates in batches of around ~1000 docs/batch. Inspiration for that sweet spot was once again determined after reading one of Erik's Lucidworks blog posts and testing (https://lucidworks.com/post/really-batch-updates-solr-2/).
>
>
>
> Again thanks to the community and users for everyone’s contribution on the issue it is very much appreciated.
>
>
> Successful Solr-ing to all,
>
>
> Dwane
>
> ________________________________
> From: Bram Van Dam <[hidden email]>
> Sent: Wednesday, 27 May 2020 5:34 AM
> To: [hidden email] <[hidden email]>
> Subject: Re: Solr Deletes
>
> On 26/05/2020 14:07, Erick Erickson wrote:
> > So best practice is to go ahead and use delete-by-id.
>
>
> I've noticed that this can cause issues when using implicit routing, at
> least on 7.x. Though I can't quite remember whether the issue was a
> performance issue, or whether documents would sometimes not get deleted.
>
> In either case, I worked it around it by doing something like this:
>
> UpdateRequest req = new UpdateRequest();
> req.deleteById(id);
> req.setCommitWithin(-1);
> req.setParam(ShardParams._ROUTE_, shard);
>
> Maybe that'll help if you run into either of those issues.
>
> - Bram