[jira] [Commented] (SOLR-12187) Replica should watch clusterstate and unload itself if its entry is removed

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-12187) Replica should watch clusterstate and unload itself if its entry is removed

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-12187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439929#comment-16439929 ]

Shalin Shekhar Mangar commented on SOLR-12187:
----------------------------------------------

Hi Dat, the changes looks good to me. +1 to commit.

One improvement that we can make is to make the collection state watcher notifications more robust. Currently there is no exception handling in ZkStateReader.Notification thread, perhaps we should add some now that we rely so much on those notifications.

> Replica should watch clusterstate and unload itself if its entry is removed
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-12187
>                 URL: https://issues.apache.org/jira/browse/SOLR-12187
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12187.patch, SOLR-12187.patch, SOLR-12187.patch, SOLR-12187.patch, SOLR-12187.patch
>
>
> With the introduction of autoscaling framework, we have seen an increase in the number of issues related to the race condition between delete a replica and other stuff.
> Case 1: DeleteReplicaCmd failed to send UNLOAD request to a replica, therefore, forcefully remove its entry from clusterstate, but the replica still function normally and be able to become a leader -> SOLR-12176
> Case 2:
>  * DeleteReplicaCmd enqueue a DELETECOREOP (without sending a request to replica because the node is not live)
>  * The node start and the replica get loaded
>  * DELETECOREOP has not processed hence the replica still present in clusterstate --> pass checkStateInZk
>  * DELETECOREOP is executed, DeleteReplicaCmd finished
>  ** result 1: the replica start recovering, finish it and publish itself as ACTIVE --> state of the replica is ACTIVE
>  ** result 2: the replica throw an exception (probably: NPE)
> --> state of the replica is DOWN, not join leader election



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]