Solr 7.3 cluster issue

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 7.3 cluster issue

oandgtc
Happy holidays folks, we have a production deployment usage Solr 7.3 in a three node cluster we have a number of collections setup, three shards with a replica factor of 2. The system has been fine, but we experienced issues with disk space one of the nodes.

Node 0 starts but does not show any cores / replicas, the solr.log is full of these "o.a.s.c.ZkController org.apache.solr.common.SolrException: Replica core_node7 is not present in cluster state: null”

Node 1 and Node 2 are OK, all data from all collections is accessible.

Can I recreate node 0 as though it had failed completely ?, is it OK to remove the references to the replicas (missing) and recreate. Would you be able to provide me some guidance of the safest way to reintroduce node 0 given our situation.

Many thanks

Dave
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.3 cluster issue

Jan Høydahl / Cominvent
Wonder what clusterstate actually says. I can think of two things that could possibly heal the cluster:

A rolling restart of all nodes may make Solr heal itself, but the risk is that some shards may not have a replica and if you get stuck in recovery during restart you have downtime.

Another way could be to use admin UI and remove all replicas from the defunct node. Then reboot/reinstall that node and then add back missing replicas and let solr replicate shards to the new node.

A third more defensive way is to add a fourth node, add replicas to it to make all collections redundant and then remove replicas from the defunct node and finally decommission it.

Jan Høydahl

> 28. des. 2019 kl. 02:17 skrev David Barnett <[hidden email]>:
>
> Happy holidays folks, we have a production deployment usage Solr 7.3 in a three node cluster we have a number of collections setup, three shards with a replica factor of 2. The system has been fine, but we experienced issues with disk space one of the nodes.
>
> Node 0 starts but does not show any cores / replicas, the solr.log is full of these "o.a.s.c.ZkController org.apache.solr.common.SolrException: Replica core_node7 is not present in cluster state: null”
>
> Node 1 and Node 2 are OK, all data from all collections is accessible.
>
> Can I recreate node 0 as though it had failed completely ?, is it OK to remove the references to the replicas (missing) and recreate. Would you be able to provide me some guidance of the safest way to reintroduce node 0 given our situation.
>
> Many thanks
>
> Dave
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.3 cluster issue

Erick Erickson
+1 to Jan’s comments, especially the idea of adding a 4th node and doing your ADDREPLICAs to that before doing the DELETEREPLICAS for the replicas on the sick node. I’ve used this to bring clusters back to health. This assumes you have at least one active leader for all shards.

That ZK error is weird, what’s the full stack trace?

Best,
Erick


> On Dec 28, 2019, at 9:10 AM, Jan Høydahl <[hidden email]> wrote:
>
> Wonder what clusterstate actually says. I can think of two things that could possibly heal the cluster:
>
> A rolling restart of all nodes may make Solr heal itself, but the risk is that some shards may not have a replica and if you get stuck in recovery during restart you have downtime.
>
> Another way could be to use admin UI and remove all replicas from the defunct node. Then reboot/reinstall that node and then add back missing replicas and let solr replicate shards to the new node.
>
> A third more defensive way is to add a fourth node, add replicas to it to make all collections redundant and then remove replicas from the defunct node and finally decommission it.
>
> Jan Høydahl
>
>> 28. des. 2019 kl. 02:17 skrev David Barnett <[hidden email]>:
>>
>> Happy holidays folks, we have a production deployment usage Solr 7.3 in a three node cluster we have a number of collections setup, three shards with a replica factor of 2. The system has been fine, but we experienced issues with disk space one of the nodes.
>>
>> Node 0 starts but does not show any cores / replicas, the solr.log is full of these "o.a.s.c.ZkController org.apache.solr.common.SolrException: Replica core_node7 is not present in cluster state: null”
>>
>> Node 1 and Node 2 are OK, all data from all collections is accessible.
>>
>> Can I recreate node 0 as though it had failed completely ?, is it OK to remove the references to the replicas (missing) and recreate. Would you be able to provide me some guidance of the safest way to reintroduce node 0 given our situation.
>>
>> Many thanks
>>
>> Dave

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.3 cluster issue

oandgtc
In reply to this post by Jan Høydahl / Cominvent
Hi Jan et all

clusterstate shows all cores and replicas on node 1 and 2 but node 0 is empty. All three nodes live_nodes shows the correct 3 node addresses.

Thanks for the advice, we will use a 4th node.
On 28 Dec 2019, 14:10 +0000, Jan Høydahl <[hidden email]>, wrote:

> Wonder what clusterstate actually says. I can think of two things that could possibly heal the cluster:
>
> A rolling restart of all nodes may make Solr heal itself, but the risk is that some shards may not have a replica and if you get stuck in recovery during restart you have downtime.
>
> Another way could be to use admin UI and remove all replicas from the defunct node. Then reboot/reinstall that node and then add back missing replicas and let solr replicate shards to the new node.
>
> A third more defensive way is to add a fourth node, add replicas to it to make all collections redundant and then remove replicas from the defunct node and finally decommission it.
>
> Jan Høydahl
>
> > 28. des. 2019 kl. 02:17 skrev David Barnett <[hidden email]>:
> >
> > Happy holidays folks, we have a production deployment usage Solr 7.3 in a three node cluster we have a number of collections setup, three shards with a replica factor of 2. The system has been fine, but we experienced issues with disk space one of the nodes.
> >
> > Node 0 starts but does not show any cores / replicas, the solr.log is full of these "o.a.s.c.ZkController org.apache.solr.common.SolrException: Replica core_node7 is not present in cluster state: null”
> >
> > Node 1 and Node 2 are OK, all data from all collections is accessible.
> >
> > Can I recreate node 0 as though it had failed completely ?, is it OK to remove the references to the replicas (missing) and recreate. Would you be able to provide me some guidance of the safest way to reintroduce node 0 given our situation.
> >
> > Many thanks
> >
> > Dave