CLUSTERSTATUS times out after 180s

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

CLUSTERSTATUS times out after 180s

Shai Erera

I'm using vanilla Solr 5.1.0 and see the CLUSTERSTATUS API sometimes times out after 180s in unit tests. For instance, one test does the following:

- Start an embedded Solr instance (equivalent to MiniSolrCloudCluster), and sets up HttpSolrClient against it.
- Starts another Solr instance
- Creates a collection with one shard and one replica
- Each of the Solr instances' wrapping code polls the cluster status by issuing a CLUSTERSTATUS request against its local/sibling Solr instance. I.e. it doesn't use ZkStateReader, but exercises the Collections API.
- The test verifies adding a second replica to this collection, and times out if this isn't accomplished successfully after some period.
- The test eventually fails and in the logs I see that one of the Solr instances (that was supposed to add the missing replica) timed out while waiting for the CLUSTERSTATUS call to return. Since I don't expect it to fail, I set the test timeout to less than 180s.

While reviewing the logs I realized it's not so easy to tell which of the Solr instances succeeds in obtaining the cluster status and which isn't. I have my suspicion but I'm not sure, so I improved the test's logging. It's not so easy to reproduce, so for now I would like to ask these:

1) I saw users complained about this in the past already, and there was at least one JIRA issue that claimed to fix it. Is this still a known, or often hit, problem?

2) Will it also occur if I obtained the cluster status via CloudSolrClient.getZkStateReader().getClusterState()? If not, can someone explain the differences?

3) If I were to change the HttpSolrClient's underlying HttpClient to timeout earlier on requests, would this request timeout as well? I ask because I don't know what's the default connection timeout, and I'm only guessing it's not >3 minutes, but could be wrong. I am thinking of doing this because I feel like something's "stuck" in the Solr instance, and if I were to retry the operation it would pass.

4) I am using the Collections API rather than CloudSolrClient because it seems like this is the recommended path for users. Is it so, or should I just use CloudSolrClient for these purposes?

If there are any best practices about configuring the tests (as well as an actual Solr instance) to avoid these issues, I'd appreciate if you can share them.