Solr 7 Nodes Suck in "Gone" State

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 7 Nodes Suck in "Gone" State

Marko Babic
Hi everyone, I ran into a situation in which a portion of the nodes in a Solr cluster were stuck in the GONE state (no ephemeral znode under /live_nodes) despite the Solr processes being alive and connected to the ZooKeeper tier. For context this a cluster running Solr 7.7.1 and ZooKeeper 3.4.13 (being monitored by Exhibitor 1.7.1).

The events that led to this situation, in the order that they occurred:


  *   We have three ZooKeeper nodes: A, B, and C. A is the leader of the ensemble.
  *   ZooKeeper A becomes partitioned from ZooKeeper B and C and the Solr tier.
  *   Some Solr nodes log “zkclient has disconnected” warnings and ZooKeeper A expires some Solr client sessions due to timeouts. The partition between Zookeeper A and the Solr tier ends and Solr nodes that were connected to ZooKeeper A attempt to renew their sessions but are told their sessions have expired. [1]
     *   Note that I’m simplifying: some nodes that were connected to ZooKeeper A were able to move their sessions to ZooKeeper B/C before their session expired. [2]
  *   ZooKeeper A realizes it is not synced with ZooKeeper B and C and closes connections with Solr nodes and, apparently, remains partitioned from B/C.
  *   ZooKeeper B and C eventually elect ZooKeeper B as the leader and start accepting writes requests as they form a quorum.
  *   Solr nodes previously connected to ZooKeeper that had their sessions expire now connect to ZooKeeper B and C, they successfully publish their state as DOWN, and then attempt to write to /live_nodes to signal that they’re reconnected to ZooKeeper.
  *   The writes of the ephemeral znodes to /live_nodes fail with NodeExists exceptions [3]. The failed writes are logged on ZooKeeper B. [4]
     *   It looks like a failure mode of “leader becomes partitioned and ephemeral znode deletions are not processed by followers” is documented on ZOOKEEPER-2348<https://jira.apache.org/jira/browse/ZOOKEEPER-2348>.
  *   ZooKeeper A eventually rejoins the ensemble and the /live_nodes entries that expired after the initial partition are removed when session expirations are reprocessed on the new leader (ZooKeeper B) [5]
  *   The Solr nodes whose attempts at writing to /live_nodes failed never try again and remain in the GONE state for 6+ hours.


Restarting the GONE Solr nodes made them reconnect to ZooKeeper and everything recovered just fine after that
My questions are:

  *   Is the sequence of events I’ve laid out plausible? Is there anything that stands out as being “wrong” that I should look more in to?
  *   If not, is there anything I can/should be doing to prevent something like happening again or at least reducing the probability of its occurrence? Or is this just a failure scenario that’s not being handled at the moment and it merits a ticket?

I wasn’t able to find any mailing list thread or ticket that discussed a similar issue though I’m also not confident my search was exhaustive.☺ Apologies if this is documented somewhere.

Thanks!

Marko

[1] Log of ZooKeeper A expiring a Solr node session and not allowing the node to renew it:

2019-04-13 06:58:23.997 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Expiring session 0x7000014eb60267d, timeout of
15000ms exceeded

2019-04-13 06:58:24.015 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Client attempting to renew session 0x7000014eb60267d at /XXX.XXX.XXX.XXX:54792
2019-04-13 06:58:24.015 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Invalid session 0x7000014eb60267d for client /XXX.XXX.XXX.XXX:54792, probably expired

2019-04-13 06:58:24.017 [ProcessThread(sid:9 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Processed session termination for sessionid: 0x7000014eb60267d

[2] Solr node moving its session from ZooKeeper A to ZooKeeper B/C:

Solr:

2019-04-13 06:58:15.134000 [main-SendThread(zookeeper03.localdomain:2181)] WARN  org.apache.zookeeper.ClientCnxn [] u2013 Client session timed out, have not heard from server in 10005ms for sessionid 0x900000676eb264b

ZooKeeper A:

2019-04-13 06:58:24.011 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN  org.apache.zookeeper.server.NIOServerCnxn  – Unable to read additional data from client sessionid 0x900000676eb264b, likely client has closed socket
2019-04-13 06:58:24.011 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.NIOServerCnxn  – Closed socket connection for client /XXX.XXX.XXX.XXX:55750 which had sessionid 0x900000676eb264b

2019-04-13 06:58:24.018 [ProcessThread(sid:9 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Got user-level KeeperException when processing sessionid:0x900000676eb264b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:null Error:KeeperErrorCode = Session moved

ZooKeeper B:

2019-04-13 06:58:16.068 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Client attempting to renew session 0x900000676eb264b at /XXX.XXX.XXX.XXX:32836
2019-04-13 06:58:16.068 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.Learner  – Revalidating client: 0x900000676eb264b


[3] Example log lines + stack trace of /live_nodes write failure:

2019-04-13 06:58:24.022000 [main-SendThread(zookeeper03.localdomain:2181)] WARN  org.apache.zookeeper.ClientCnxn [] u2013 Unable to reconnect to ZooKeeper service, session ≈ has expired
2019-04-13 06:59:01.082000 [zkConnectionManagerCallback-6-thread-1] INFO  org.apache.solr.cloud.ZkController [] u2013 ZooKeeper session re-connected ... refreshing core states after session expiration.
2019-04-13 06:59:01.101000 [zkConnectionManagerCallback-6-thread-1] INFO  org.apache.solr.cloud.ZkController [] u2013 Publish node=XXX.XXX.XXX.XXX:8983_solr as DOWN
2019-04-13 06:59:01.105000 [zkConnectionManagerCallback-6-thread-1] INFO  org.apache.solr.cloud.ZkController [] u2013 Register node as live in ZooKeeper:/live_nodes/XXX.XXX.XXX.XXX:8983_solr
2019-04-13 06:59:01.108000 [zkConnectionManagerCallback-6-thread-1] ERROR org.apache.solr.cloud.ZkController [] u2013 :org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1015)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:919)
        at org.apache.solr.common.cloud.SolrZkClient.lambda$multi$9(SolrZkClient.java:604)
        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
        at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:604)
        at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:1089)
        at org.apache.solr.cloud.ZkController.access$500(ZkController.java:149)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:376)
        at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:177)
        at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:58)
        at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:151)
        at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:285)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

[4] Example log lines + stack trace of /live_nodes write failure in zk logs:

2019-04-13 06:59:01.405 [ProcessThread(sid:8 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Got user-level KeeperException when processing sessionid:0x7006dfdb0d00004 type:multi cxid:0x1a zxid:0x2700000035 txntype:-1 reqpath:n/a aborting remaining multi ops. Error Path:/live_nodes/10.17.129.123:8983_solr Error:KeeperErrorCode = NodeExists for /live_nodes/XXX.XXX.XXX.XXXx:8983_solr

[5] ZooKeeper B (new leader after ensemble partitions) processing old session expirations. Note that sessionId 0x7000014eb60267d is the same session that ZooKeeper A expired in [1].

2019-04-13 06:59:16.071 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Expiring session 0x7000014eb60267d, timeout of 15000ms exceeded

2019-04-13 06:59:16.074 [ProcessThread(sid:8 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Processed session termination for sessionid: 0x7000014eb60267d


Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 Nodes Suck in "Gone" State

Marko Babic
Apologies for bumping my own post but I'm just wondering if it'd be more appropriate for me to cut a ticket at this point rather than ask the mailing list.

FWIW we've had no reoccurrences of the issue.

On 4/15/19, 11:56 AM, "Marko Babic" <[hidden email]> wrote:

    Hi everyone, I ran into a situation in which a portion of the nodes in a Solr cluster were stuck in the GONE state (no ephemeral znode under /live_nodes) despite the Solr processes being alive and connected to the ZooKeeper tier. For context this a cluster running Solr 7.7.1 and ZooKeeper 3.4.13 (being monitored by Exhibitor 1.7.1).
   
    The events that led to this situation, in the order that they occurred:
   
   
      *   We have three ZooKeeper nodes: A, B, and C. A is the leader of the ensemble.
      *   ZooKeeper A becomes partitioned from ZooKeeper B and C and the Solr tier.
      *   Some Solr nodes log “zkclient has disconnected” warnings and ZooKeeper A expires some Solr client sessions due to timeouts. The partition between Zookeeper A and the Solr tier ends and Solr nodes that were connected to ZooKeeper A attempt to renew their sessions but are told their sessions have expired. [1]
         *   Note that I’m simplifying: some nodes that were connected to ZooKeeper A were able to move their sessions to ZooKeeper B/C before their session expired. [2]
      *   ZooKeeper A realizes it is not synced with ZooKeeper B and C and closes connections with Solr nodes and, apparently, remains partitioned from B/C.
      *   ZooKeeper B and C eventually elect ZooKeeper B as the leader and start accepting writes requests as they form a quorum.
      *   Solr nodes previously connected to ZooKeeper that had their sessions expire now connect to ZooKeeper B and C, they successfully publish their state as DOWN, and then attempt to write to /live_nodes to signal that they’re reconnected to ZooKeeper.
      *   The writes of the ephemeral znodes to /live_nodes fail with NodeExists exceptions [3]. The failed writes are logged on ZooKeeper B. [4]
         *   It looks like a failure mode of “leader becomes partitioned and ephemeral znode deletions are not processed by followers” is documented on ZOOKEEPER-2348<https://jira.apache.org/jira/browse/ZOOKEEPER-2348>.
      *   ZooKeeper A eventually rejoins the ensemble and the /live_nodes entries that expired after the initial partition are removed when session expirations are reprocessed on the new leader (ZooKeeper B) [5]
      *   The Solr nodes whose attempts at writing to /live_nodes failed never try again and remain in the GONE state for 6+ hours.
   
   
    Restarting the GONE Solr nodes made them reconnect to ZooKeeper and everything recovered just fine after that
    My questions are:
   
      *   Is the sequence of events I’ve laid out plausible? Is there anything that stands out as being “wrong” that I should look more in to?
      *   If not, is there anything I can/should be doing to prevent something like happening again or at least reducing the probability of its occurrence? Or is this just a failure scenario that’s not being handled at the moment and it merits a ticket?
   
    I wasn’t able to find any mailing list thread or ticket that discussed a similar issue though I’m also not confident my search was exhaustive.☺ Apologies if this is documented somewhere.
   
    Thanks!
   
    Marko
   
    [1] Log of ZooKeeper A expiring a Solr node session and not allowing the node to renew it:
   
    2019-04-13 06:58:23.997 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Expiring session 0x7000014eb60267d, timeout of
    15000ms exceeded
    …
    2019-04-13 06:58:24.015 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Client attempting to renew session 0x7000014eb60267d at /XXX.XXX.XXX.XXX:54792
    2019-04-13 06:58:24.015 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Invalid session 0x7000014eb60267d for client /XXX.XXX.XXX.XXX:54792, probably expired
    …
    2019-04-13 06:58:24.017 [ProcessThread(sid:9 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Processed session termination for sessionid: 0x7000014eb60267d
   
    [2] Solr node moving its session from ZooKeeper A to ZooKeeper B/C:
   
    Solr:
   
    2019-04-13 06:58:15.134000 [main-SendThread(zookeeper03.localdomain:2181)] WARN  org.apache.zookeeper.ClientCnxn [] u2013 Client session timed out, have not heard from server in 10005ms for sessionid 0x900000676eb264b
   
    ZooKeeper A:
   
    2019-04-13 06:58:24.011 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN  org.apache.zookeeper.server.NIOServerCnxn  – Unable to read additional data from client sessionid 0x900000676eb264b, likely client has closed socket
    2019-04-13 06:58:24.011 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.NIOServerCnxn  – Closed socket connection for client /XXX.XXX.XXX.XXX:55750 which had sessionid 0x900000676eb264b
    …
    2019-04-13 06:58:24.018 [ProcessThread(sid:9 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Got user-level KeeperException when processing sessionid:0x900000676eb264b type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:null Error:KeeperErrorCode = Session moved
   
    ZooKeeper B:
   
    2019-04-13 06:58:16.068 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Client attempting to renew session 0x900000676eb264b at /XXX.XXX.XXX.XXX:32836
    2019-04-13 06:58:16.068 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.quorum.Learner  – Revalidating client: 0x900000676eb264b
   
   
    [3] Example log lines + stack trace of /live_nodes write failure:
   
    2019-04-13 06:58:24.022000 [main-SendThread(zookeeper03.localdomain:2181)] WARN  org.apache.zookeeper.ClientCnxn [] u2013 Unable to reconnect to ZooKeeper service, session ≈ has expired
    2019-04-13 06:59:01.082000 [zkConnectionManagerCallback-6-thread-1] INFO  org.apache.solr.cloud.ZkController [] u2013 ZooKeeper session re-connected ... refreshing core states after session expiration.
    2019-04-13 06:59:01.101000 [zkConnectionManagerCallback-6-thread-1] INFO  org.apache.solr.cloud.ZkController [] u2013 Publish node=XXX.XXX.XXX.XXX:8983_solr as DOWN
    2019-04-13 06:59:01.105000 [zkConnectionManagerCallback-6-thread-1] INFO  org.apache.solr.cloud.ZkController [] u2013 Register node as live in ZooKeeper:/live_nodes/XXX.XXX.XXX.XXX:8983_solr
    2019-04-13 06:59:01.108000 [zkConnectionManagerCallback-6-thread-1] ERROR org.apache.solr.cloud.ZkController [] u2013 :org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
            at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
            at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1015)
            at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:919)
            at org.apache.solr.common.cloud.SolrZkClient.lambda$multi$9(SolrZkClient.java:604)
            at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
            at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:604)
            at org.apache.solr.cloud.ZkController.createEphemeralLiveNode(ZkController.java:1089)
            at org.apache.solr.cloud.ZkController.access$500(ZkController.java:149)
            at org.apache.solr.cloud.ZkController$1.command(ZkController.java:376)
            at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:177)
            at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:58)
            at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:151)
            at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:285)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   
    [4] Example log lines + stack trace of /live_nodes write failure in zk logs:
   
    2019-04-13 06:59:01.405 [ProcessThread(sid:8 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Got user-level KeeperException when processing sessionid:0x7006dfdb0d00004 type:multi cxid:0x1a zxid:0x2700000035 txntype:-1 reqpath:n/a aborting remaining multi ops. Error Path:/live_nodes/10.17.129.123:8983_solr Error:KeeperErrorCode = NodeExists for /live_nodes/XXX.XXX.XXX.XXXx:8983_solr
   
    [5] ZooKeeper B (new leader after ensemble partitions) processing old session expirations. Note that sessionId 0x7000014eb60267d is the same session that ZooKeeper A expired in [1].
   
    2019-04-13 06:59:16.071 [SessionTracker] INFO  org.apache.zookeeper.server.ZooKeeperServer  – Expiring session 0x7000014eb60267d, timeout of 15000ms exceeded
    …
    2019-04-13 06:59:16.074 [ProcessThread(sid:8 cport:-1):] INFO  org.apache.zookeeper.server.PrepRequestProcessor  – Processed session termination for sessionid: 0x7000014eb60267d
   
   
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 Nodes Suck in "Gone" State

Shawn Heisey-2
On 4/26/2019 9:23 AM, Marko Babic wrote:
> Apologies for bumping my own post but I'm just wondering if it'd be more appropriate for me to cut a ticket at this point rather than ask the mailing list.

No, the mailing list is the correct location.  If the discussion
determines that there is a bug to be fixed, then it would be appropriate
to open an issue in Jira.  For this project, Jira is not used as a
support portal.

Have you informed your Solr server about all three ZK servers in the
zkhost string, or is it only connecting to server A (the one that was
partitioned)?

I'm having a difficult time mentally following the partition description
you sent.  Almost certainly a failing on my part, not yours.  When I
have a little more free time, I will examine it closer.

 From what I've been able to decipher so far, the zookeeper bug that you
linked sounds probable.  The bug only mentions 3.5.1 as an affected
version.  Solr 7.7.1 includes ZK 3.4.13.  I will comment on the bug to
find out whether it affects 3.4 versions as well.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 Nodes Suck in "Gone" State

Marko Babic
Thanks Shawn.

Yes, all Solr nodes know about all three ZK servers (i.e., the zk host string is of the form zk_a_ip:2181,zk_b_ip:2181,zk_c_ip:2181).

Sorry for the dense description of things: I erred on the side of oversharing because I didn't want to leave out something useful but I know it makes for an investment to read so I really appreciate that you took the time. I'm obviously happy to clarify whatever I can.

I'll make sure to follow the discussion on ZOOKEEPER-2348.

Thanks again,

Marko

If there's anything I can do to make

On 4/26/19, 2:51 PM, "Shawn Heisey" <[hidden email]> wrote:

    On 4/26/2019 9:23 AM, Marko Babic wrote:
    > Apologies for bumping my own post but I'm just wondering if it'd be more appropriate for me to cut a ticket at this point rather than ask the mailing list.
   
    No, the mailing list is the correct location.  If the discussion
    determines that there is a bug to be fixed, then it would be appropriate
    to open an issue in Jira.  For this project, Jira is not used as a
    support portal.
   
    Have you informed your Solr server about all three ZK servers in the
    zkhost string, or is it only connecting to server A (the one that was
    partitioned)?
   
    I'm having a difficult time mentally following the partition description
    you sent.  Almost certainly a failing on my part, not yours.  When I
    have a little more free time, I will examine it closer.
   
     From what I've been able to decipher so far, the zookeeper bug that you
    linked sounds probable.  The bug only mentions 3.5.1 as an affected
    version.  Solr 7.7.1 includes ZK 3.4.13.  I will comment on the bug to
    find out whether it affects 3.4 versions as well.
   
    Thanks,
    Shawn
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 Nodes Suck in "Gone" State

Shawn Heisey-2
On 4/29/2019 10:55 AM, Marko Babic wrote:
> Thanks Shawn.
>
> Yes, all Solr nodes know about all three ZK servers (i.e., the zk host string is of the form zk_a_ip:2181,zk_b_ip:2181,zk_c_ip:2181).
>
> Sorry for the dense description of things: I erred on the side of oversharing because I didn't want to leave out something useful but I know it makes for an investment to read so I really appreciate that you took the time. I'm obviously happy to clarify whatever I can.

Trying to trace everything is making my head hurt. :)

Part of the problem is that I do not really know all that much about
ZK's internal operation.

I do know that ZK clients maintain continuous connections (as long as
they are able) to all of the servers in the zkhost string.  I'm guessing
that if one of the servers it can reach has been elected leader on the
ensemble, it will be preferred to all others for that client to talk to.

My reading says that ephemeral nodes should be deleted whenever the
client-server connection is lost for any reason.  If I read your writeup
correctly, somehow the network partition is interfering with this
process... the ephemeral node probably is deleted on A (the leader when
the partition begins) but it is not deleted on the new leader.  This
does sound like ZOOKEEPER-2348.

We probably need to take a look at how Solr handles its /live_nodes
entries.  I have not looked at this code, and have no idea how it works,
but here is what I can think of:

Perhaps each Solr node should update its ephemeral node on a timed
interval, say every 5 seconds.  Longer if the update operation creates a
lot of I/O.  If the node exists exception is encountered when trying to
create the node, the node should check the last updated timestamp, and
once it reaches an age of 30 or 60 seconds (definitely configurable),
the Solr node should assume that it's safe to delete and recreate.  The
log for this ought to be at WARN or ERROR (probably WARN) so they are
visible in the admin UI.  If some of the other devs who live in the
SolrCloud code could offer a review of this idea, I would appreciate it.

In theory, two different Solr instances should never be trying to create
the same ephemeral znode.  In environments where servers are
automatically provisioned and started, I suppose it could happen.

I've updated the ZK issue with info from this thread.  I hope they can
comment on that.

Thanks,
Shawn