[jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Igor Motov (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697 ]

Hoss Man commented on SOLR-13464:
---------------------------------

In theory it would be possible for a test client (or any real production client) to poll {{/admin/auth...}} on all/any nodes in a cluster to verify that they are using the updated security settings, because the behavior of SecurityConfHandlerZk on GET is to read the _cached_ security props from the ZkStateReader, so in theory it's only updated once it's been force refreshed by the zk watcher ... but this still has 2 problems:
 # any client doing this would have to be statefull and know what the most recent setting(s) change was, so it could assert those specific settings have been updated. There's no way for a "dumb" client to simply ask "is your current config up to date w/zk". Even if the client directly polled ZK to see what the current version is in the authoritative {{/security.json}} for the cluster, the "version" info isn't included in the {{GET /admin/auth...}} responses, so it would have to do a "deep comparison" of the entire JSON response.
 # even if client knows what data to expect from a {{GET /admin/auth...}} request when polling all/any nodes in the cluster (either from first hand knowledge because it was the client that did the last POST, or second hand knowledge from querying ZK directly) and even if the expected data is returned by every node, that doesn't mean it's in *USE* yet – there is inherient lag between when the security conf data is "refreshed" in the ZkStateReader (on each node) and when the plugin Object instance are actually initialized and become active on each node.

----
Here's a strawman proposal for a possible solution to this problem – both for use in tests, and for end users, that might want to verify when updated settings are in really enabled...
 # refactor CoreContainer so that methods like {{public AuthorizationPlugin getAuthorizationPlugin()}} are deprecated/syntactic sugar for new {{public SecurityPluginHolder<AuthorizationPlugin> getAuthorizationPlugin()}} methods so that callers can read the znode version used to init the plugin
 # refactor {{SecurityConfHandler.getPlugin(String)}} to be a deprecated/syntactic sugar for a new version that returns {{SecurityPluginHolder<?>}}
 # update {{SecurityConfHandlerZk.getConf}} so that it:
 ** uses {{getSecurityConfig(true)}} to ensure it reads the most current settings from ZK, (instead of the cached copy used by the current code).
 ** adds the {{SecurityConfig.getVersion()}} number in the response (in addition to the config data) ... perhaps as {{key + ".conf.version"}}
 ** when {{getPlugin(key)}} is non null, include the {{SecurityPluginHolder.getVersion()}} in the response ... perhaps as {{key + ".enabled.version"}}

...that way a dumb client can easily poll any/all node(s) for {{/admin/auth_foo}} until the {{auth_foo.conf.version}} and {{auth_foo.enabled.version}} are identical to know when the most recent {{auth_foo}} settings in ZK's security.json are actaully in use.

(We could potentially take things even a step further, and add something like a {{verify.cluster.version=true|false}} option to SecurityConfHandlerZk, that would federate {{GET /admin/auth...}} to every (live?) node in the cluster, and include map of nodeName => enabled.version for every node ... maybe?)

Thoughts?

> Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13464
>                 URL: https://issues.apache.org/jira/browse/SOLR-13464
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Hoss Man
>            Priority: Major
>
> I've been investigating some sporadic and hard to reproduce test failures related to authentication in cloud mode, and i *think* (but have not directly verified) that the common cause is that after uses one of the {{/admin/auth...}} handlers to update some setting, there is an inherient and unpredictible delay (due to ZK watches) until every node in the cluster has had a chance to (re)load the new configuration and initialize the various security plugins with the new settings.
> Which means, if a test client does a POST to some node to add/change/remove some authn/authz settings, and then immediately hits the exact same node (or any other node) to test that the effects of those settings exist, there is no garuntee that they will have taken affect yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Jan Høydahl / Cominvent
Hoss, I see several of these failures popping up, probably related to timing of the config reload across nodes. Should we as a phase 1 introduce a simple sleep to harden those tests and follow up later with APIs that support waiting until config propagates?

Jan Høydahl

> 11. mai 2019 kl. 01:46 skrev Hoss Man (JIRA) <[hidden email]>:
>
>
>    [ https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697 ]
>
> Hoss Man commented on SOLR-13464:
> ---------------------------------
>
> In theory it would be possible for a test client (or any real production client) to poll {{/admin/auth...}} on all/any nodes in a cluster to verify that they are using the updated security settings, because the behavior of SecurityConfHandlerZk on GET is to read the _cached_ security props from the ZkStateReader, so in theory it's only updated once it's been force refreshed by the zk watcher ... but this still has 2 problems:
> # any client doing this would have to be statefull and know what the most recent setting(s) change was, so it could assert those specific settings have been updated. There's no way for a "dumb" client to simply ask "is your current config up to date w/zk". Even if the client directly polled ZK to see what the current version is in the authoritative {{/security.json}} for the cluster, the "version" info isn't included in the {{GET /admin/auth...}} responses, so it would have to do a "deep comparison" of the entire JSON response.
> # even if client knows what data to expect from a {{GET /admin/auth...}} request when polling all/any nodes in the cluster (either from first hand knowledge because it was the client that did the last POST, or second hand knowledge from querying ZK directly) and even if the expected data is returned by every node, that doesn't mean it's in *USE* yet – there is inherient lag between when the security conf data is "refreshed" in the ZkStateReader (on each node) and when the plugin Object instance are actually initialized and become active on each node.
>
> ----
> Here's a strawman proposal for a possible solution to this problem – both for use in tests, and for end users, that might want to verify when updated settings are in really enabled...
> # refactor CoreContainer so that methods like {{public AuthorizationPlugin getAuthorizationPlugin()}} are deprecated/syntactic sugar for new {{public SecurityPluginHolder<AuthorizationPlugin> getAuthorizationPlugin()}} methods so that callers can read the znode version used to init the plugin
> # refactor {{SecurityConfHandler.getPlugin(String)}} to be a deprecated/syntactic sugar for a new version that returns {{SecurityPluginHolder<?>}}
> # update {{SecurityConfHandlerZk.getConf}} so that it:
> ** uses {{getSecurityConfig(true)}} to ensure it reads the most current settings from ZK, (instead of the cached copy used by the current code).
> ** adds the {{SecurityConfig.getVersion()}} number in the response (in addition to the config data) ... perhaps as {{key + ".conf.version"}}
> ** when {{getPlugin(key)}} is non null, include the {{SecurityPluginHolder.getVersion()}} in the response ... perhaps as {{key + ".enabled.version"}}
>
> ...that way a dumb client can easily poll any/all node(s) for {{/admin/auth_foo}} until the {{auth_foo.conf.version}} and {{auth_foo.enabled.version}} are identical to know when the most recent {{auth_foo}} settings in ZK's security.json are actaully in use.
>
> (We could potentially take things even a step further, and add something like a {{verify.cluster.version=true|false}} option to SecurityConfHandlerZk, that would federate {{GET /admin/auth...}} to every (live?) node in the cluster, and include map of nodeName => enabled.version for every node ... maybe?)
>
> Thoughts?
>
>> Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config
>> -------------------------------------------------------------------------------------------
>>
>>                Key: SOLR-13464
>>                URL: https://issues.apache.org/jira/browse/SOLR-13464
>>            Project: Solr
>>         Issue Type: Bug
>>     Security Level: Public(Default Security Level. Issues are Public)
>>           Reporter: Hoss Man
>>           Priority: Major
>>
>> I've been investigating some sporadic and hard to reproduce test failures related to authentication in cloud mode, and i *think* (but have not directly verified) that the common cause is that after uses one of the {{/admin/auth...}} handlers to update some setting, there is an inherient and unpredictible delay (due to ZK watches) until every node in the cluster has had a chance to (re)load the new configuration and initialize the various security plugins with the new settings.
>> Which means, if a test client does a POST to some node to add/change/remove some authn/authz settings, and then immediately hits the exact same node (or any other node) to test that the effects of those settings exist, there is no garuntee that they will have taken affect yet.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Chris Hostetter-3

: Hoss, I see several of these failures popping up, probably related to
: timing of the config reload across nodes. Should we as a phase 1
: introduce a simple sleep to harden those tests and follow up later with
: APIs that support waiting until config propagates?

Well, I personally refuse to add any sleep calls to any tests -- but
that's my personal opinion.  You and others may have your own opinions and
take differnet actions then i would take :)

https://twitter.com/_hossman/status/974743183044128768

:
: Jan Høydahl
:
: > 11. mai 2019 kl. 01:46 skrev Hoss Man (JIRA) <[hidden email]>:
: >
: >
: >    [ https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697 ]
: >
: > Hoss Man commented on SOLR-13464:
: > ---------------------------------
: >
: > In theory it would be possible for a test client (or any real production client) to poll {{/admin/auth...}} on all/any nodes in a cluster to verify that they are using the updated security settings, because the behavior of SecurityConfHandlerZk on GET is to read the _cached_ security props from the ZkStateReader, so in theory it's only updated once it's been force refreshed by the zk watcher ... but this still has 2 problems:
: > # any client doing this would have to be statefull and know what the most recent setting(s) change was, so it could assert those specific settings have been updated. There's no way for a "dumb" client to simply ask "is your current config up to date w/zk". Even if the client directly polled ZK to see what the current version is in the authoritative {{/security.json}} for the cluster, the "version" info isn't included in the {{GET /admin/auth...}} responses, so it would have to do a "deep comparison" of the entire JSON response.
: > # even if client knows what data to expect from a {{GET /admin/auth...}} request when polling all/any nodes in the cluster (either from first hand knowledge because it was the client that did the last POST, or second hand knowledge from querying ZK directly) and even if the expected data is returned by every node, that doesn't mean it's in *USE* yet – there is inherient lag between when the security conf data is "refreshed" in the ZkStateReader (on each node) and when the plugin Object instance are actually initialized and become active on each node.
: >
: > ----
: > Here's a strawman proposal for a possible solution to this problem – both for use in tests, and for end users, that might want to verify when updated settings are in really enabled...
: > # refactor CoreContainer so that methods like {{public AuthorizationPlugin getAuthorizationPlugin()}} are deprecated/syntactic sugar for new {{public SecurityPluginHolder<AuthorizationPlugin> getAuthorizationPlugin()}} methods so that callers can read the znode version used to init the plugin
: > # refactor {{SecurityConfHandler.getPlugin(String)}} to be a deprecated/syntactic sugar for a new version that returns {{SecurityPluginHolder<?>}}
: > # update {{SecurityConfHandlerZk.getConf}} so that it:
: > ** uses {{getSecurityConfig(true)}} to ensure it reads the most current settings from ZK, (instead of the cached copy used by the current code).
: > ** adds the {{SecurityConfig.getVersion()}} number in the response (in addition to the config data) ... perhaps as {{key + ".conf.version"}}
: > ** when {{getPlugin(key)}} is non null, include the {{SecurityPluginHolder.getVersion()}} in the response ... perhaps as {{key + ".enabled.version"}}
: >
: > ...that way a dumb client can easily poll any/all node(s) for {{/admin/auth_foo}} until the {{auth_foo.conf.version}} and {{auth_foo.enabled.version}} are identical to know when the most recent {{auth_foo}} settings in ZK's security.json are actaully in use.
: >
: > (We could potentially take things even a step further, and add something like a {{verify.cluster.version=true|false}} option to SecurityConfHandlerZk, that would federate {{GET /admin/auth...}} to every (live?) node in the cluster, and include map of nodeName => enabled.version for every node ... maybe?)
: >
: > Thoughts?
: >
: >> Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config
: >> -------------------------------------------------------------------------------------------
: >>
: >>                Key: SOLR-13464
: >>                URL: https://issues.apache.org/jira/browse/SOLR-13464
: >>            Project: Solr
: >>         Issue Type: Bug
: >>     Security Level: Public(Default Security Level. Issues are Public)
: >>           Reporter: Hoss Man
: >>           Priority: Major
: >>
: >> I've been investigating some sporadic and hard to reproduce test failures related to authentication in cloud mode, and i *think* (but have not directly verified) that the common cause is that after uses one of the {{/admin/auth...}} handlers to update some setting, there is an inherient and unpredictible delay (due to ZK watches) until every node in the cluster has had a chance to (re)load the new configuration and initialize the various security plugins with the new settings.
: >> Which means, if a test client does a POST to some node to add/change/remove some authn/authz settings, and then immediately hits the exact same node (or any other node) to test that the effects of those settings exist, there is no garuntee that they will have taken affect yet.
: >
: >
: >
: > --
: > This message was sent by Atlassian JIRA
: > (v7.6.3#76005)
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
: >
:

-Hoss
http://www.lucidworks.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]