Adding replica to a shard with only down replicas

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding replica to a shard with only down replicas

tedsolr
Solr 5.5.4. I have a collection with a single shard and two replicas. Both
are reporting down. No shard leader exists. Each replica is on a different
node. Should it be safe to attempt an ADDREPLICA command? Since there's no
leader I don't know if that will work. This is the cluster state for the
collection:

"SCHN":{
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node6":{
                "core":"SCHN_shard1_replica5",
                "base_url":"http://----:8983/solr",
                "node_name":"----:8983_solr",
                "state":"down"},
              "core_node5":{
                "core":"SCHN_shard1_replica2",
                "base_url":"http://----8983/solr",
                "node_name":"----:8983_solr",
                "state":"down"}}}},
        "replicationFactor":"2",
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":1127,
        "configName":"default"},

The logs show repeated errors for: ERROR
org.apache.solr.common.SolrException Error while trying to recover.
core=SCHN_shard1_replica5:org.apache.solr.common.SolrException: No
registered leader was found after waiting for 4000ms , collection: SCHN
slice: shard1

I've already tried bringing the nodes down and then back up.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

Erick Erickson
Adding a new replica won’t do you much good. Since there’s
no leader, it won’t (well, shouldn’t) sync the index.

Did you try the collections API FORCELEADER? It was put in as
a last resort for this kind of situation.

Best,
Erick

> On Feb 13, 2020, at 3:22 PM, tedsolr <[hidden email]> wrote:
>
> Solr 5.5.4. I have a collection with a single shard and two replicas. Both
> are reporting down. No shard leader exists. Each replica is on a different
> node. Should it be safe to attempt an ADDREPLICA command? Since there's no
> leader I don't know if that will work. This is the cluster state for the
> collection:
>
> "SCHN":{
>        "shards":{"shard1":{
>            "range":"80000000-7fffffff",
>            "state":"active",
>            "replicas":{
>              "core_node6":{
>                "core":"SCHN_shard1_replica5",
>                "base_url":"http://----:8983/solr",
>                "node_name":"----:8983_solr",
>                "state":"down"},
>              "core_node5":{
>                "core":"SCHN_shard1_replica2",
>                "base_url":"http://----8983/solr",
>                "node_name":"----:8983_solr",
>                "state":"down"}}}},
>        "replicationFactor":"2",
>        "router":{"name":"compositeId"},
>        "maxShardsPerNode":"1",
>        "autoAddReplicas":"false",
>        "znodeVersion":1127,
>        "configName":"default"},
>
> The logs show repeated errors for: ERROR
> org.apache.solr.common.SolrException Error while trying to recover.
> core=SCHN_shard1_replica5:org.apache.solr.common.SolrException: No
> registered leader was found after waiting for 4000ms , collection: SCHN
> slice: shard1
>
> I've already tried bringing the nodes down and then back up.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

tedsolr
Yes I did Erick, and that didn't do it. What about manual manipulation of the
zookeeper data? Rather than telling the customer they need to rebuild from
scratch, I'd prefer to attempt some last minute heroics.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

lstusr 5u93n4
We've seen this type of deadlock pretty often. Our recourse is to restart
solr on only one of the nodes, this seems to force the leader election to
take place and it soon stars rebuilding.

Let me know if you try that and it works... Wouldn't mind another
validation point that this happens to others...

Good luck!

On Fri, 14 Feb 2020 at 09:20, tedsolr <[hidden email]> wrote:

> Yes I did Erick, and that didn't do it. What about manual manipulation of
> the
> zookeeper data? Rather than telling the customer they need to rebuild from
> scratch, I'd prefer to attempt some last minute heroics.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

lstusr 5u93n4
Actually I should clarify: we stop solr on one of the nodes, wait for the
other node to become the leader, and then start solr back up on the one
that was stopped.

On Fri, 14 Feb 2020 at 09:41, lstusr 5u93n4 <[hidden email]> wrote:

> We've seen this type of deadlock pretty often. Our recourse is to restart
> solr on only one of the nodes, this seems to force the leader election to
> take place and it soon stars rebuilding.
>
> Let me know if you try that and it works... Wouldn't mind another
> validation point that this happens to others...
>
> Good luck!
>
> On Fri, 14 Feb 2020 at 09:20, tedsolr <[hidden email]> wrote:
>
>> Yes I did Erick, and that didn't do it. What about manual manipulation of
>> the
>> zookeeper data? Rather than telling the customer they need to rebuild from
>> scratch, I'd prefer to attempt some last minute heroics.
>>
>>
>>
>> --
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

Erick Erickson
Yes, you can manually manipulate the data in Zookeeper, but as you
say that’s a “heroic” option. But even if it’s totally messed up, you’re no
worse off. You can use bin/solr zk… to copy individual znodes up and
down, or there are various tools to let you do the same if you have
them.

It’s also possible to shut the whole cluster down and bring up one and
only one node. NOTE: there’s something like a 3 minute wait before
the leader can be elected, so you can’t be impatient.

It should also be possible to create a parallel collection, leader only. By
parallel I mean the same number of shards, leader only. Then shut it down
and copy the corresponding data directory over from the sick collection
and start the new collection back up. Assuming it comes back, either
use collection aliasing to point to it or reverse the process. Take extreme
care to copy from the same shard range…. In fact, it might be easiest to
copy the index by using the _replication api_ to issue a fetchindex from the
sick node to the new one. That’s a low-level, command that bypasses
SolrCloud. All it needs is an HTTP connection between the source and target
machines.

Best,
Erick


> On Feb 14, 2020, at 9:49 AM, lstusr 5u93n4 <[hidden email]> wrote:
>
> Actually I should clarify: we stop solr on one of the nodes, wait for the
> other node to become the leader, and then start solr back up on the one
> that was stopped.
>
> On Fri, 14 Feb 2020 at 09:41, lstusr 5u93n4 <[hidden email]> wrote:
>
>> We've seen this type of deadlock pretty often. Our recourse is to restart
>> solr on only one of the nodes, this seems to force the leader election to
>> take place and it soon stars rebuilding.
>>
>> Let me know if you try that and it works... Wouldn't mind another
>> validation point that this happens to others...
>>
>> Good luck!
>>
>> On Fri, 14 Feb 2020 at 09:20, tedsolr <[hidden email]> wrote:
>>
>>> Yes I did Erick, and that didn't do it. What about manual manipulation of
>>> the
>>> zookeeper data? Rather than telling the customer they need to rebuild from
>>> scratch, I'd prefer to attempt some last minute heroics.
>>>
>>>
>>>
>>> --
>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

tedsolr
Overnight the replicas with a state of "down" changed to "recovery_failed".
Nothing I did. So I brought down both nodes, then started one and waited 5
min. A leader was born then I started the other node. So luckily no heroics
were needed.

I'll remember your advice about creating a parallel collection and copying
the data directory.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Adding replica to a shard with only down replicas

Erick Erickson
Glad it worked out, I like to avoid heroics whenever possible ;)…

It can take quite some time for solr to finally and for good give up,
waiting 10-15 minutes for something to change seems like an eternity.

What’s happening here is the node attempts to recover but fails for some
reason. So it backs off and tries again. And again… before throwing
in the towel.

The parallel collection is also kind of a last-ditch thing to try, but at least
it keeps the old collection around so you can try heroics if the parallel
collection doesn’t work ;).

Best,
Erick

> On Feb 14, 2020, at 11:53 AM, tedsolr <[hidden email]> wrote:
>
> Overnight the replicas with a state of "down" changed to "recovery_failed".
> Nothing I did. So I brought down both nodes, then started one and waited 5
> min. A leader was born then I started the other node. So luckily no heroics
> were needed.
>
> I'll remember your advice about creating a parallel collection and copying
> the data directory.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html