We have been using Solr 6.5.1 leveraging SolrCloud backed by ZooKeeper for a multi-client, multi-node cluster for several months now and have been having a few stability/recovery issues we’d like to confirm if they are fixed in Solr 7 or not. We run 3 large Solr nodes in the cluster (each with 64 GBs of heap, 100’s of collections, and 9000+ cores). We have found that if anything happens (e.g. long garbage collection pauses) that puts replicas of the same shard on more than one node into the down/recovering state at the same, then they usually fail to complete the recovery process and reach the active state again, even though all of the nodes in the cluster are up and running. The same thing happens if we shut down multiple nodes and try to start them up at the same time. It looks like each replica thinks the other is supposed to be the leader and results in neither ever accepting the responsibility. Is this a common error in SolrCloud and to the best of the knowledge of the mailing list has it been fixed in any Solr 7 releases?
A second issue we have faced with the same cluster is that when the above situation arises, our recovery steps is to stop all the impacted nodes (sometimes it’s all 3 nodes in the cluster) and start them one at a time, waiting for all cores on each node to recover before starting the next one. During this process we have found two different issues. One issue is that recovering the nodes seems to take a long time, sometimes more than one hour for all the cores to move into the active state. Another and more pressing issue that we have run into is sometimes it seems the order in which we start the servers back up matters. We sometimes run into cases where we start a node and all but a few (< 10 cores) won’t recover even after several hours and several attempts of restarting the same node. These cores never leave the down state. To fix this we need to then stop the node and attempt starting another node until we find one that fully recovers so that we can return to the originally problematic node to try again- which has always worked in the end but only after a lot of pain. Our hope is to get to a day where we can start all the nodes in the cluster at the same time and have it “just work” i.e. converge on a fully active state 100% of the time vs. managing this one node at a time process. Our expectation is that if all the nodes in the cluster are up and healthy then all of the cores in the cluster would eventually reach the active state, but that seems to just not be the case. Are there any fixes in Solr 7 or potentially some configuration we can add to Solr 6 that resolve either of these issues?
On 1/8/2019 12:12 PM, Johnston, Charlie wrote:
> We have been using Solr 6.5.1 leveraging SolrCloud backed by ZooKeeper for a multi-client, multi-node cluster for several months now and have been having a few stability/recovery issues we’d like to confirm if they are fixed in Solr 7 or not. We run 3 large Solr nodes in the cluster (each with 64 GBs of heap, 100’s of collections, and 9000+ cores).
Managing that many indexes is currently SolrCloud's achilles heel.
Splitting the cluster across more Solr nodes will help to some degree,
but dealing with thousands of replicas in a single cluster is simply not
going to scale. In addition to splitting the indexes across more nodes,
you may also need to create multiple clusters so that each cluster is
managing a smaller number of shard replicas.
This is a known problem, and there is constantly work underway to try
and improve the situation. I need to repeat the experiments that I did
on SOLR-7191 on a much newer version so that I can have a better idea of
whether the situation has improved in 7.x versions.
The discussion on SOLR-7191 is long and very dense, but it might be
worth reading. You have about twice as many cores as I was creating in
my experiments, which means that Solr will be processing more messages
for recovery operations: