A couple of weeks ago, I ran into an unusual problem with Solr on which I could find previous discussion.
I have a 4 node Solr cluster with 2 collections, ‘A’ and ‘B’. Each of the collections has 1 shard and 3 replicas. Both collections are updated with a delta-import that pulls from a postgres database every 5 minutes. Collection ‘A’ is very small (~1.5k documents, ~7 MB) and there are no queries run against it. Collection ‘B’ is ~90k documents and about ~500MB and has a heavy query load during certain parts of the day. There is an auto hard commit every 15 seconds. Both collections run a nightly full import during low query load without issue.
There was a large delta on Collection ‘B’ that caused nearly every document to be updated. This occurred while the query load was high. Collection ‘B’ has 2 different entity types, ‘1’, and ‘2,’ which are in a ~1:3 ratio. There were both “adds” and “deletes”.
Looking at the logs, the data import process completed for entity ‘1’, but not entity ‘2.’ There were no errors, exceptions, or warnings in the log and the telemetry did not show that any of the cluster nodes ran out of heap or diskspace. It is usually the case that a full import (or large delta) would run well within 20 minutes, but this particular import was running for at least an hour.
A more concerning development was that soon after the data import began to process entity ‘2,’ all of the nodes in the cluster began to continuously send a high volume of /update add requests that contained up to 200 document ids. This high volume of adds occurred for at least 15 minutes and appears to have spiked the CPU and GC on the cluster nodes and led to a high volume of query timeouts. Typically, the /update adds messages would contain 1 (or rarely 2) documents.
The cluster was restarted in a rolling fashion (one node at a time), but this didn’t appear to resolve all of the issues. Only after all of the replicas were deleted and then re-added (through the Admin console) did the flood of /updates subside.
Has anyone ever observed this kind of behavior? Is there a known issue or a procedure to follow for getting a cluster out of this state?
I was able to reproduce the /update “adds” flood by starting a large delta, putting the cluster under heavy load, and then forcing a second delta immediately after the first delta finished. However, this is obviously not exactly the same event, because the large deltas actually ran to completion for both entity ‘1’ and entity ‘2’. In this case, forcing a commit seemed to reduce the volume of the large /update adds messages, but didn’t completely eliminate them. Deleting and re-adding the replicas seemed to fix this issue as well.
Any insight into this would be very helpful. Thanks!