Issues with new replicas in solrcloud cluster being created as NRT by autoscaling and not respecting cluster policy

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Issues with new replicas in solrcloud cluster being created as NRT by autoscaling and not respecting cluster policy

Matt Goward [Contractor]
I am working on an autoscaling kubernetes cluster for solrcloud running 7.5.  I have most of this up and working, but ran into a few issues when I got to the point of testing.  The core of it is that when solr replaces a replica it is doing so as NRT rather than TLOG and it is not respecting the cluster policy on selecting solr node locations for that replica.

######### SETUP #######
I am creating the collection with all TLOG nodes for now.  We have 6 nodes, 2 each in each of 3 availability zones.

My relevant preferences for the autoscaling/solrcloud are:
curl -X POST http://localhost:32080/v2/cluster/autoscaling -d '{ "set-cluster-preferences" : [  {"minimize" : "cores"},{"maximize" : "freedisk", "precision" : 10},{"minimize" : "sysLoadAvg"}]}'

curl -X POST http://localhost:32080/v2/cluster/autoscaling -d '{ "set-trigger" : {"name":"node_added_trigger","event":"nodeAdded","waitFor":"1m", "enabled" : true,"actions" : [{"name" : "compute_plan","class": "solr.ComputePlanAction"},{"name" : "execute_plan","class": "solr.ExecutePlanAction"}]}}'

curl -X POST http://localhost:32080/v2/cluster/autoscaling -d '{ "set-cluster-policy" : [{"shard":"#ANY", "replica": "<3","sysprop.K8SNODE":"*"},{"shard":"#ANY", "replica": "<3","sysprop.EC2AZ":"*"}]}'

curl 'http://localhost:32080/solr/admin/collections?action=CREATE&name=.system&numShards=1&autoAddReplicas=true&tlogReplicas=3&nrtReplicas=0'


The K8SNODE and EC2AZ are passed in via -D args at start time.

The collection in question is created as:
curl 'http://localhost:32080/solr/admin/collections?action=CREATE&name=testing.v1&collection.configName=testing.v1&tlogReplicas=6&numShards=2&autoAddReplicas=true&nrtReplicas=0'


##### Induce failure and cause issue in question #####

This all creates as expected, but it does ignore the policy.  We then delete a pair of nodes, let solr notice and recreate the replicas that existed on those nodes.  Bring up a pair of new nodes, it notices and moves the new replicas onto them. All exactly as it should except for the TYPE. screenshot of node tree:
https://monosnap.com/file/jGhVbfB1Aa5HpghcGMy6sh8Oe99Hjx


Log entry is the same for the .system and testing.v1 collections:


createReplica() {

  "operation":"ADDREPLICA",

  "collection":".system",

  "shard":"shard1",

  "core":".system_shard1_replica_n1",

  "state":"down",

  "base_url":"http://172.28.149.41:32080/solr",

  "type":"NRT",

  "waitForFinalState":"false"}

2018-11-05 15:05:46.170 INFO  (OverseerStateUpdate-245134345588899840-172.28.151.122:32080_solr-n_0000000000) [   ] o.a.s.c.o.SliceMutator createReplica() {

  "operation":"ADDREPLICA",

  "collection":".system",

  "shard":"shard1",

  "core":".system_shard1_replica_n3",

  "state":"down",

  "base_url":"http://172.28.154.245:32080/solr",

  "type":"NRT",

  "waitForFinalState":"false"}

2018-11-05 15:05:46.194 INFO  (OverseerStateUpdate-245134345588899840-172.28.151.122:32080_solr-n_0000000000) [   ] o.a.s.c.o.SliceMutator createReplica() {

  "operation":"ADDREPLICA",

  "collection":".system",

  "shard":"shard1",

  "core":".system_shard1_replica_n5",

  "state":"down",

  "base_url":"http://172.28.156.38:32080/solr",

  "type":"NRT",

  "waitForFinalState":"false"} ‚Äč



I am struggling to find details in the docs that call out how to tell the cluster what ratio of TLOG and or PULL should be as it moves things around.  Either way, if it is replacing a TLOG node it should replace it with a TLOG, right?

######### Cluster Policy Issue ############
The next issue is the cluster policy.  The goal is to make sure a given replica is not duplicated on a physical kubernetes node (K8SNODE) and that we keep 2 copies of each shard in each availability zone.  Near as I can tell these rules are simply being ignored.  If I create rules within a collection it works as expected

(curl 'http://localhost:32080/solr/admin/collections?action=CREATE&name=testing.v1&collection.configName=testing.v1&maxShardsPerNode=32&numShards=2&replicationFactor=6&autoAddReplicas=true&rule=shard:*,replica:<2,sysprop.K8SNODE:*&rule=shard:*,replica:>1,sysprop.EC2AZ:*') but then I cannot use TLOG replicas because when I try to create the collection with those rules in place it complains:

{

  "responseHeader":{

    "status":400,

    "QTime":134},

  "Operation create caused exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: TLOG or PULL replica types not supported with placement rules or cluster policies",

  "exception":{

    "msg":"TLOG or PULL replica types not supported with placement rules or cluster policies",

    "rspCode":400},

  "error":{

    "metadata":[

      "error-class","org.apache.solr.common.SolrException",

      "root-error-class","org.apache.solr.common.SolrException"],

    "msg":"TLOG or PULL replica types not supported with placement rules or cluster policies",

    "code":400}}



‚ÄčAlso of note, the docs do not call out in any meaningful way that you can or cannot use TLOG or PULL replicas with placement rules or cluster policies.  The fact that the docs DO call out that you cant NRT with TLOGs is not supported would seem in conflict with this concept.

Please let me know what additional information would be helpful in getting to the root cause on this.

Thank you,
Matthew












________________________________
ITHAKA email addresses for contractors are provided solely for the necessary and limited internal purposes of ITHAKA and are not intended for external communications.