[jira] [Comment Edited] (SOLR-11730) Test NodeLost / NodeAdded dynamics

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (SOLR-11730) Test NodeLost / NodeAdded dynamics

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282027#comment-16282027 ]

Andrzej Bialecki  edited comment on SOLR-11730 at 12/7/17 6:44 PM:
-------------------------------------------------------------------

Simulations indicate that even with significant flakiness (with outages lasting up to {{waitFor + cooldown}}) the framework may not take any actions if there are other events happening too, because even if a {{nodeLost}} trigger creates an event then that event may still be discarded due to the cooldown period. And after the cooldown period has passed the flaky node may be back up again, so the event would not be generated again.


was (Author: ab):
Simulations indicate that even with significant flakiness the framework may not take any actions if there are other events happening too, because even if a node lost trigger creates an event that event may be discarded due to the cooldown period. And after the cooldown period has passed the flaky node may be back up again, so the event would not be generated again.

> Test NodeLost / NodeAdded dynamics
> ----------------------------------
>
>                 Key: SOLR-11730
>                 URL: https://issues.apache.org/jira/browse/SOLR-11730
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: AutoScaling
>            Reporter: Andrzej Bialecki
>
> Let's consider a "flaky node" scenario.
> A node is going up and down at short intervals (eg. due to a flaky network cable). If the frequency of these events coincides with {{waitFor}} interval in {{nodeLost}} trigger configuration, the node may never be reported to the autoscaling framework as lost. Similarly it may never be reported as added back if it's lost again within the {{waitFor}} period of {{nodeAdded}} trigger.
> Other scenarios are possible here too, depending on timing:
> * node being constantly reported as lost
> * node being constantly reported as added
> One possible solution for the autoscaling triggers is that the framework should keep a short-term ({{waitFor * 2}} long?) memory of a node state that the trigger is tracking in order to eliminate flaky nodes (ie. those that transitioned between states more than once within the period).
> Situation like this is detrimental to SolrCloud behavior regardless of autoscaling actions, so it should probably be addressed at a node level by eg. shutting down Solr node after the number of disconnects in a time window reaches a certain threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]