[jira] [Created] (SOLR-10720) Aggressive removal of a collection breaks cluster state

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Created] (SOLR-10720) Aggressive removal of a collection breaks cluster state

JIRA jira@apache.org
Alexey Serba created SOLR-10720:

             Summary: Aggressive removal of a collection breaks cluster state
                 Key: SOLR-10720
                 URL: https://issues.apache.org/jira/browse/SOLR-10720
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 6.5.1
            Reporter: Alexey Serba

We are periodically seeing tricky concurrency bug in SolrCloud that starts with `Could not fully remove collection: my_collection` exception:

2017-05-17T14:47:50,153 - ERROR [OverseerThreadFactory-6-thread-5:SolrException@159] - {} - Collection: my_collection operation: delete failed:org.apache.solr.common.SolrException: Could not fully remove collection: my_collection
        at org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:106)
        at org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:224)
        at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:463)

After that all operations with SolrCloud that involve reading cluster state fail with

org.apache.solr.common.SolrException: Error loading config name for collection my_collection
    at org.apache.solr.common.cloud.ZkStateReader.readConfigName(ZkStateReader.java:198)
    at org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:141)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/my_collection

See full [stacktraces|https://gist.github.com/serba/9b7932f005f34f6cd9a511e226c6f0c6]

As a result SolrCloud becomes completely broken. We are seeing this with 6.5.1 but I think we’ve seen that with older versions too.

From looking into the code it looks like it is a combination of two factors:
* Forcefully removing collection's znode in finally block in [DeleteCollectionCmd|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L115] that was introduced in SOLR-5135. Note that this causes cached cluster state to be not in sync with the state in Zk, i.e. {{zkStateReader.getClusterState()}} still has collection in it (see the code [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L98]) whereas {{/collections/<collection_id>}} znode in Zk is already removed.
* Reading cluster state operation not only returns cached version, but it is also reading collection's config name from {{/collections/<collection_id>}} znode, but this znode was forcefully removed. The code to read config name for every collection directly from Zk was introduced in SOLR-7636. Isn't there any performance implications of reading N znodes (1 per collection) on every {{getClusterStatus}} call?

I'm not sure what the proper fix should be
* Should we just catch {{KeeperException$NoNodeException}} in {{getClusterStatus}} and treat such collection as removed? That looks easiest / less invasive fix.
* Should we stop reading config name from collection znode and get it from cache somehow?
* Should we not try to delete collection's data from Zk if delete operation failed?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]