leader election stuck after hosts restarts

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

leader election stuck after hosts restarts

Pierre Salagnac
Hello,
We had a stuck leader election for a shard.

We have collections with 2 shards, each shard has 5 replicas. We have many
collections but the issue happened for a single shard. Once all host
restarts completed, this shard was stuck with one replica is "recovery"
state and all other is "down" state.

Here is the state of the shard returned by CLUSTERSTATUS command.
      "replicas":{
        "core_node3":{
          "core":"...._shard1_replica_n1",
          "base_url":"https://host1:8983/solr",
          "node_name":"host1:8983_solr",
          "state":"recovering",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node9":{
          "core":"...._shard1_replica_n6",
          "base_url":"https://host2:8983/solr",
          "node_name":"host2:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node26":{
          "core":"...._shard1_replica_n25",
          "base_url":"https://host3:8983/solr",
          "node_name":"host3:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node28":{
          "core":"...._shard1_replica_n27",
          "base_url":"https://host4:8983/solr",
          "node_name":"host4:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"},
        "core_node34":{
          "core":"...._shard1_replica_n33",
          "base_url":"https://host5:8983/solr",
          "node_name":"host5:8983_solr",
          "state":"down",
          "type":"NRT",
          "force_set_state":"false"}}}

The workarounds to shutdown server host1 with the replica stuck in recovery
state. This unblocked leader election, the 4 other replicas went active.

Here is the first error I found in logs related to this shard. It happened
while shutting a server host3 that was the leader at that time/
 (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
consuming and closing http response stream. =>
java.nio.channels.AsynchronousCloseException
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
java.nio.channels.AsynchronousCloseException: null
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
at java.io.InputStream.read(InputStream.java:205) ~[?:?]
at
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]

My understanding is following this error, each server restart ended in the
replica on this server being in "down" state, but I'm not sure how to
confirm that.
We then entered in a loop where term is increased because of failed
replication.

Is this a know issue? I found no similar ticket in Jira.
Could you please having a better understanding of the issue?
Thanks
Reply | Threaded
Open this post in threaded view
|

Re: leader election stuck after hosts restarts

matthew sporleder
When this has happened to me before I have had pretty good luck by
restarting the overseer leader, which can be found in zookeeper under
/overseer_elect/leader

If that doesn't work I've had to do more intrusive and manual recovery
methods, which suck.

On Tue, Jan 12, 2021 at 10:36 AM Pierre Salagnac
<[hidden email]> wrote:

>
> Hello,
> We had a stuck leader election for a shard.
>
> We have collections with 2 shards, each shard has 5 replicas. We have many
> collections but the issue happened for a single shard. Once all host
> restarts completed, this shard was stuck with one replica is "recovery"
> state and all other is "down" state.
>
> Here is the state of the shard returned by CLUSTERSTATUS command.
>       "replicas":{
>         "core_node3":{
>           "core":"...._shard1_replica_n1",
>           "base_url":"https://host1:8983/solr",
>           "node_name":"host1:8983_solr",
>           "state":"recovering",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node9":{
>           "core":"...._shard1_replica_n6",
>           "base_url":"https://host2:8983/solr",
>           "node_name":"host2:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node26":{
>           "core":"...._shard1_replica_n25",
>           "base_url":"https://host3:8983/solr",
>           "node_name":"host3:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node28":{
>           "core":"...._shard1_replica_n27",
>           "base_url":"https://host4:8983/solr",
>           "node_name":"host4:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"},
>         "core_node34":{
>           "core":"...._shard1_replica_n33",
>           "base_url":"https://host5:8983/solr",
>           "node_name":"host5:8983_solr",
>           "state":"down",
>           "type":"NRT",
>           "force_set_state":"false"}}}
>
> The workarounds to shutdown server host1 with the replica stuck in recovery
> state. This unblocked leader election, the 4 other replicas went active.
>
> Here is the first error I found in logs related to this shard. It happened
> while shutting a server host3 that was the leader at that time/
>  (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
> consuming and closing http response stream. =>
> java.nio.channels.AsynchronousCloseException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> java.nio.channels.AsynchronousCloseException: null
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
>
> My understanding is following this error, each server restart ended in the
> replica on this server being in "down" state, but I'm not sure how to
> confirm that.
> We then entered in a loop where term is increased because of failed
> replication.
>
> Is this a know issue? I found no similar ticket in Jira.
> Could you please having a better understanding of the issue?
> Thanks
Reply | Threaded
Open this post in threaded view
|

Re: leader election stuck after hosts restarts

Phill Campbell
In reply to this post by Pierre Salagnac
Which version of Apache Solr?

> On Jan 12, 2021, at 8:36 AM, Pierre Salagnac <[hidden email]> wrote:
>
> Hello,
> We had a stuck leader election for a shard.
>
> We have collections with 2 shards, each shard has 5 replicas. We have many
> collections but the issue happened for a single shard. Once all host
> restarts completed, this shard was stuck with one replica is "recovery"
> state and all other is "down" state.
>
> Here is the state of the shard returned by CLUSTERSTATUS command.
>      "replicas":{
>        "core_node3":{
>          "core":"...._shard1_replica_n1",
>          "base_url":"https://host1:8983/solr",
>          "node_name":"host1:8983_solr",
>          "state":"recovering",
>          "type":"NRT",
>          "force_set_state":"false"},
>        "core_node9":{
>          "core":"...._shard1_replica_n6",
>          "base_url":"https://host2:8983/solr",
>          "node_name":"host2:8983_solr",
>          "state":"down",
>          "type":"NRT",
>          "force_set_state":"false"},
>        "core_node26":{
>          "core":"...._shard1_replica_n25",
>          "base_url":"https://host3:8983/solr",
>          "node_name":"host3:8983_solr",
>          "state":"down",
>          "type":"NRT",
>          "force_set_state":"false"},
>        "core_node28":{
>          "core":"...._shard1_replica_n27",
>          "base_url":"https://host4:8983/solr",
>          "node_name":"host4:8983_solr",
>          "state":"down",
>          "type":"NRT",
>          "force_set_state":"false"},
>        "core_node34":{
>          "core":"...._shard1_replica_n33",
>          "base_url":"https://host5:8983/solr",
>          "node_name":"host5:8983_solr",
>          "state":"down",
>          "type":"NRT",
>          "force_set_state":"false"}}}
>
> The workarounds to shutdown server host1 with the replica stuck in recovery
> state. This unblocked leader election, the 4 other replicas went active.
>
> Here is the first error I found in logs related to this shard. It happened
> while shutting a server host3 that was the leader at that time/
> (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
> consuming and closing http response stream. =>
> java.nio.channels.AsynchronousCloseException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> java.nio.channels.AsynchronousCloseException: null
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
>
> My understanding is following this error, each server restart ended in the
> replica on this server being in "down" state, but I'm not sure how to
> confirm that.
> We then entered in a loop where term is increased because of failed
> replication.
>
> Is this a know issue? I found no similar ticket in Jira.
> Could you please having a better understanding of the issue?
> Thanks

Reply | Threaded
Open this post in threaded view
|

Re: leader election stuck after hosts restarts

Pierre Salagnac
Sorry I missed this detail.
We are running Solr 8.2.
Thanks

Le mar. 12 janv. 2021 à 16:46, Phill Campbell <[hidden email]>
a écrit :

> Which version of Apache Solr?
>
> > On Jan 12, 2021, at 8:36 AM, Pierre Salagnac <[hidden email]>
> wrote:
> >
> > Hello,
> > We had a stuck leader election for a shard.
> >
> > We have collections with 2 shards, each shard has 5 replicas. We have
> many
> > collections but the issue happened for a single shard. Once all host
> > restarts completed, this shard was stuck with one replica is "recovery"
> > state and all other is "down" state.
> >
> > Here is the state of the shard returned by CLUSTERSTATUS command.
> >      "replicas":{
> >        "core_node3":{
> >          "core":"...._shard1_replica_n1",
> >          "base_url":"https://host1:8983/solr",
> >          "node_name":"host1:8983_solr",
> >          "state":"recovering",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node9":{
> >          "core":"...._shard1_replica_n6",
> >          "base_url":"https://host2:8983/solr",
> >          "node_name":"host2:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node26":{
> >          "core":"...._shard1_replica_n25",
> >          "base_url":"https://host3:8983/solr",
> >          "node_name":"host3:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node28":{
> >          "core":"...._shard1_replica_n27",
> >          "base_url":"https://host4:8983/solr",
> >          "node_name":"host4:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"},
> >        "core_node34":{
> >          "core":"...._shard1_replica_n33",
> >          "base_url":"https://host5:8983/solr",
> >          "node_name":"host5:8983_solr",
> >          "state":"down",
> >          "type":"NRT",
> >          "force_set_state":"false"}}}
> >
> > The workarounds to shutdown server host1 with the replica stuck in
> recovery
> > state. This unblocked leader election, the 4 other replicas went active.
> >
> > Here is the first error I found in logs related to this shard. It
> happened
> > while shutting a server host3 that was the leader at that time/
> > (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> > r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> > x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient
> Error
> > consuming and closing http response stream. =>
> > java.nio.channels.AsynchronousCloseException
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> > java.nio.channels.AsynchronousCloseException: null
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> > at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> > at
> >
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> > at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> > at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> > at
> >
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> > at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> > ~[?:?]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> > ~[?:?]
> > at java.lang.Thread.run(Thread.java:834) [?:?]
> >
> > My understanding is following this error, each server restart ended in
> the
> > replica on this server being in "down" state, but I'm not sure how to
> > confirm that.
> > We then entered in a loop where term is increased because of failed
> > replication.
> >
> > Is this a know issue? I found no similar ticket in Jira.
> > Could you please having a better understanding of the issue?
> > Thanks
>
>
Reply | Threaded
Open this post in threaded view
|

Re: leader election stuck after hosts restarts

Alessandro Benedetti
I faced these problems a while ago, but at the time I created a blog post
which I hope could help:
https://sease.io/2018/05/solrcloud-leader-election-failing.html



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: leader election stuck after hosts restarts

Pierre Salagnac
Thanks Alessandro.

We found this Jira ticket that may be the root cause of this issue:
https://issues.apache.org/jira/browse/SOLR-14356
I'm not sure whether it is the reason of the leader election initially
failing, but it prevents Solr from exiting this error loop.

Le mer. 13 janv. 2021 à 21:37, Alessandro Benedetti <[hidden email]>
a écrit :

> I faced these problems a while ago, but at the time I created a blog post
> which I hope could help:
> https://sease.io/2018/05/solrcloud-leader-election-failing.html
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>