Inconsistent leader between ZK and Solr and a lot of downtime

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent leader between ZK and Solr and a lot of downtime

Daniel Carrasco
Hello,

I'm investigating an 8 nodes Solr 7.2.1 cluster because we've a lot of
problems, like when a node fails to import from a DB (maybe it freeze), the
entire cluster goes down, and other like the leader wont change even when
is down (all nodes detects that is down but no leader election is
triggered), and similar problems. Every few days we've to recover the
cluster because becomes inestable and goes down.

The last problem that I've got, is three collections that have nodes on
"recovery" state from a lot of hours, and the log shows an error telling
that "leader node is not the leader" so I'm trying to change the leader.
After shutting down the "leader" (detected by the other nodes as down and
waiting about 20 minutes), trying REBALANCELEADER and FORCELEADER, I'm
unable to change the leader on the cluster, and that's why started to see
on ZooKeeper. The problem I've seen on Zookeeper is that Leaders are
different than Solr admin cluster info, so Maybe that's why the nodes are
unable to connect to real leader and cannot end the recovery.

The entire cluster and ZK has the traffic open to avoid problems (the VPC
is private), so is not a connection problem.

Is there any way to sync the leader info between solr and ZK?, also I want
to know if exists a way to force to change the leader (FORCELEADER don't
work when the solr denies to change the leader, because it say that a
leader exists).

Thanks!
--
_________________________________________

      Daniel Carrasco Marín
      Ingeniería para la Innovación i2TIC, S.L.
      Tlf:  +34 911 12 32 84 Ext: 223
      www.i2tic.com
_________________________________________
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent leader between ZK and Solr and a lot of downtime

Ben Knüttgen
Daniel Carrasco wrote

> Hello,
>
> I'm investigating an 8 nodes Solr 7.2.1 cluster because we've a lot of
> problems, like when a node fails to import from a DB (maybe it freeze),
> the
> entire cluster goes down, and other like the leader wont change even when
> is down (all nodes detects that is down but no leader election is
> triggered), and similar problems. Every few days we've to recover the
> cluster because becomes inestable and goes down.
>
> The last problem that I've got, is three collections that have nodes on
> "recovery" state from a lot of hours, and the log shows an error telling
> that "leader node is not the leader" so I'm trying to change the leader.

Make sure that the clocks on your servers are in sync. Otherwise inter node
authentication tokens could time out which could lead to the problems you
described. You should find hints to the cause of the communication problem
in your Solr logs.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html