Cascading failures with replicas

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Cascading failures with replicas

Walter Underwood
I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1).

I ran a load benchmark near the capacity of the cluster. This had run fine in test, this was the prod cluster.

Solr Cloud added a replica to replace the down node. The node with two cores got double the traffic and started slowly flapping in and out of service. The 95th percentile response spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas from the same shard on the same instance. Naturally, that was overloaded, and I killed the benchmark out of charity.

Bug #2 is creating a new replica when one host is down. This should be an option and default to “false”, because it causes the cascade.

Bug #3 is sending equal traffic to each core without considering the host. Each host should get equal traffic, not each core.

Bug #4 is putting two replicas from the same shard on one instance. That is just asking for trouble.

When it works, this cluster is awesome.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cascading failures with replicas

Erick Erickson
bug# 2, Solr shouldn't be adding replicas by itself unless you
specified autoAddReplicas=true when you created the collection. It
default to "false". So I'm not sure what's going on here.

bug #3. The internal load balancers are round-robin, so this is
expected. Not optimal I'll grant but expected.

bug #4. What shard placement rules are you using? There are a series
of rules for replica placement and one of the criteria (IIRC) is
exactly to try to distribute replicas to different hosts. Although
there was some glitchiness whether two JVMs on the same _host_ were
considered "the same host" or not.

bug #1 has been more or less of a pain for quite a while, work is ongoing there.

FWIW,
Erick

On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <[hidden email]> wrote:

> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1).
>
> I ran a load benchmark near the capacity of the cluster. This had run fine in test, this was the prod cluster.
>
> Solr Cloud added a replica to replace the down node. The node with two cores got double the traffic and started slowly flapping in and out of service. The 95th percentile response spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas from the same shard on the same instance. Naturally, that was overloaded, and I killed the benchmark out of charity.
>
> Bug #2 is creating a new replica when one host is down. This should be an option and default to “false”, because it causes the cascade.
>
> Bug #3 is sending equal traffic to each core without considering the host. Each host should get equal traffic, not each core.
>
> Bug #4 is putting two replicas from the same shard on one instance. That is just asking for trouble.
>
> When it works, this cluster is awesome.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cascading failures with replicas

Walter Underwood
Thanks. This is a very CPU-heavy workload, with ngram fields and very long queries. 16.7 million docs.

The whole cascading failure thing in search engines is hard. The first time I hit this was at Infoseek, over twenty years ago.

> On Mar 18, 2017, at 12:46 PM, Erick Erickson <[hidden email]> wrote:
>
> bug# 2, Solr shouldn't be adding replicas by itself unless you
> specified autoAddReplicas=true when you created the collection. It
> default to "false". So I'm not sure what's going on here.

    "autoAddReplicas":"false",

in both collections. I thought that only worked with HDFS anyway.

> bug #3. The internal load balancers are round-robin, so this is
> expected. Not optimal I'll grant but expected.

Right. Still a bug. Should be round-robin on instances, not cores.

> bug #4. What shard placement rules are you using? There are a series
> of rules for replica placement and one of the criteria (IIRC) is
> exactly to try to distribute replicas to different hosts. Although
> there was some glitchiness whether two JVMs on the same _host_ were
> considered "the same host" or not.

Separate Amazon EC2 instances, one JVM per instance, no rules, other than the default.

    "maxShardsPerNode":"1",

> bug #1 has been more or less of a pain for quite a while, work is ongoing there.

Glad to share our logs.

wunder

> FWIW,
> Erick
>
> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <[hidden email]> wrote:
>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1).
>>
>> I ran a load benchmark near the capacity of the cluster. This had run fine in test, this was the prod cluster.
>>
>> Solr Cloud added a replica to replace the down node. The node with two cores got double the traffic and started slowly flapping in and out of service. The 95th percentile response spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas from the same shard on the same instance. Naturally, that was overloaded, and I killed the benchmark out of charity.
>>
>> Bug #2 is creating a new replica when one host is down. This should be an option and default to “false”, because it causes the cascade.
>>
>> Bug #3 is sending equal traffic to each core without considering the host. Each host should get equal traffic, not each core.
>>
>> Bug #4 is putting two replicas from the same shard on one instance. That is just asking for trouble.
>>
>> When it works, this cluster is awesome.
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cascading failures with replicas

Erick Erickson
Hmmm, I'm totally mystified about how Solr is "creating a new replica
when one host is down". Are you saying this is happening
automagically? You're right the autoAddReplica bit is HDFS so having
replicas just show up is completely completely weird. In days past,
when a replica was discovered on disk when Solr started up, it would
reconstruct itself _even if the collection had been deleted_. They
would reappear in clusterstate.json, _not_ the individual state.json
files under collections in ZK. Is this at all possible?

What version of Solr are you using anyway? The reconstruction I
mentioned above is 4x IIRC.

Best,
Erick

On Sat, Mar 18, 2017 at 5:45 PM, Walter Underwood <[hidden email]> wrote:

> Thanks. This is a very CPU-heavy workload, with ngram fields and very long queries. 16.7 million docs.
>
> The whole cascading failure thing in search engines is hard. The first time I hit this was at Infoseek, over twenty years ago.
>
>> On Mar 18, 2017, at 12:46 PM, Erick Erickson <[hidden email]> wrote:
>>
>> bug# 2, Solr shouldn't be adding replicas by itself unless you
>> specified autoAddReplicas=true when you created the collection. It
>> default to "false". So I'm not sure what's going on here.
>
>     "autoAddReplicas":"false",
>
> in both collections. I thought that only worked with HDFS anyway.
>
>> bug #3. The internal load balancers are round-robin, so this is
>> expected. Not optimal I'll grant but expected.
>
> Right. Still a bug. Should be round-robin on instances, not cores.
>
>> bug #4. What shard placement rules are you using? There are a series
>> of rules for replica placement and one of the criteria (IIRC) is
>> exactly to try to distribute replicas to different hosts. Although
>> there was some glitchiness whether two JVMs on the same _host_ were
>> considered "the same host" or not.
>
> Separate Amazon EC2 instances, one JVM per instance, no rules, other than the default.
>
>     "maxShardsPerNode":"1",
>
>> bug #1 has been more or less of a pain for quite a while, work is ongoing there.
>
> Glad to share our logs.
>
> wunder
>
>> FWIW,
>> Erick
>>
>> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <[hidden email]> wrote:
>>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1).
>>>
>>> I ran a load benchmark near the capacity of the cluster. This had run fine in test, this was the prod cluster.
>>>
>>> Solr Cloud added a replica to replace the down node. The node with two cores got double the traffic and started slowly flapping in and out of service. The 95th percentile response spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas from the same shard on the same instance. Naturally, that was overloaded, and I killed the benchmark out of charity.
>>>
>>> Bug #2 is creating a new replica when one host is down. This should be an option and default to “false”, because it causes the cascade.
>>>
>>> Bug #3 is sending equal traffic to each core without considering the host. Each host should get equal traffic, not each core.
>>>
>>> Bug #4 is putting two replicas from the same shard on one instance. That is just asking for trouble.
>>>
>>> When it works, this cluster is awesome.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cascading failures with replicas

Walter Underwood
6.3.0. No idea how it is happening, but I got two replicas on the same host after one host went down.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On Mar 18, 2017, at 8:35 PM, Erick Erickson <[hidden email]> wrote:
>
> Hmmm, I'm totally mystified about how Solr is "creating a new replica
> when one host is down". Are you saying this is happening
> automagically? You're right the autoAddReplica bit is HDFS so having
> replicas just show up is completely completely weird. In days past,
> when a replica was discovered on disk when Solr started up, it would
> reconstruct itself _even if the collection had been deleted_. They
> would reappear in clusterstate.json, _not_ the individual state.json
> files under collections in ZK. Is this at all possible?
>
> What version of Solr are you using anyway? The reconstruction I
> mentioned above is 4x IIRC.
>
> Best,
> Erick
>
> On Sat, Mar 18, 2017 at 5:45 PM, Walter Underwood <[hidden email]> wrote:
>> Thanks. This is a very CPU-heavy workload, with ngram fields and very long queries. 16.7 million docs.
>>
>> The whole cascading failure thing in search engines is hard. The first time I hit this was at Infoseek, over twenty years ago.
>>
>>> On Mar 18, 2017, at 12:46 PM, Erick Erickson <[hidden email]> wrote:
>>>
>>> bug# 2, Solr shouldn't be adding replicas by itself unless you
>>> specified autoAddReplicas=true when you created the collection. It
>>> default to "false". So I'm not sure what's going on here.
>>
>>    "autoAddReplicas":"false",
>>
>> in both collections. I thought that only worked with HDFS anyway.
>>
>>> bug #3. The internal load balancers are round-robin, so this is
>>> expected. Not optimal I'll grant but expected.
>>
>> Right. Still a bug. Should be round-robin on instances, not cores.
>>
>>> bug #4. What shard placement rules are you using? There are a series
>>> of rules for replica placement and one of the criteria (IIRC) is
>>> exactly to try to distribute replicas to different hosts. Although
>>> there was some glitchiness whether two JVMs on the same _host_ were
>>> considered "the same host" or not.
>>
>> Separate Amazon EC2 instances, one JVM per instance, no rules, other than the default.
>>
>>    "maxShardsPerNode":"1",
>>
>>> bug #1 has been more or less of a pain for quite a while, work is ongoing there.
>>
>> Glad to share our logs.
>>
>> wunder
>>
>>> FWIW,
>>> Erick
>>>
>>> On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <[hidden email]> wrote:
>>>> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1).
>>>>
>>>> I ran a load benchmark near the capacity of the cluster. This had run fine in test, this was the prod cluster.
>>>>
>>>> Solr Cloud added a replica to replace the down node. The node with two cores got double the traffic and started slowly flapping in and out of service. The 95th percentile response spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas from the same shard on the same instance. Naturally, that was overloaded, and I killed the benchmark out of charity.
>>>>
>>>> Bug #2 is creating a new replica when one host is down. This should be an option and default to “false”, because it causes the cascade.
>>>>
>>>> Bug #3 is sending equal traffic to each core without considering the host. Each host should get equal traffic, not each core.
>>>>
>>>> Bug #4 is putting two replicas from the same shard on one instance. That is just asking for trouble.
>>>>
>>>> When it works, this cluster is awesome.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> [hidden email]
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>
>>

Loading...