solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Jeff Wu
Our environment still run with Solr4.7. Recently we noticed in a test. When
we stopped 1 solr server(solr02, which did OS shutdown), all the cores of
solr02 are shown as "down", but remains a few cores still as leaders. After
that, we quickly seeing all other servers are still sending requests to
that down solr server, and therefore we saw a lot of TCP waiting threads in
thread pool of other solr servers since solr02 already down.

"shard53":{
        "range":"26660000-2998ffff",
        "state":"active",
        "replicas":{
          "core_node102":{
            "state":"down",
            "base_url":"https://solr02.myhost/solr",
            "core":"collection2_shard53_replica1",
            "node_name":"https://solr02.myhost_solr",
            "leader":"true"},
          "core_node104":{
            "state":"active",
            "base_url":"https://solr04.myhost/solr",
            "core":"collection2_shard53_replica2",
            "node_name":"https://solr04.myhost/solr_solr"}}},

Is this something known bug in 4.7 and late on fixed? Any reference JIRA we
can study about?  If the solr service is stopped gracefully, we can see
leader core election happens and switched to other active core. But if we
just directly shutdown a Solr OS, we can reproduce in our environment that
some "Down" cores remains "leader" at ZK clusterstate.json
Reply | Threaded
Open this post in threaded view
|

Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Shalin Shekhar Mangar
Hi Jeff,

The leader election relies on ephemeral nodes in Zookeeper to detect
when leader or other nodes have gone down (abruptly). These ephemeral
nodes are automatically deleted by ZooKeeper after the ZK session
timeout which is by default 30 seconds. So if you kill a node then it
can take up to 30 seconds for the cluster to detect it and start a new
leader election. This won't be necessary during a graceful shutdown
because on shutdown the node will give up leader position so that a
new one can be elected. You could tune the zk session timeout to a
lower value but then it makes the cluster more sensitive to GC pauses
which can also trigger new leader elections.

On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <[hidden email]> wrote:

> Our environment still run with Solr4.7. Recently we noticed in a test. When
> we stopped 1 solr server(solr02, which did OS shutdown), all the cores of
> solr02 are shown as "down", but remains a few cores still as leaders. After
> that, we quickly seeing all other servers are still sending requests to
> that down solr server, and therefore we saw a lot of TCP waiting threads in
> thread pool of other solr servers since solr02 already down.
>
> "shard53":{
>         "range":"26660000-2998ffff",
>         "state":"active",
>         "replicas":{
>           "core_node102":{
>             "state":"down",
>             "base_url":"https://solr02.myhost/solr",
>             "core":"collection2_shard53_replica1",
>             "node_name":"https://solr02.myhost_solr",
>             "leader":"true"},
>           "core_node104":{
>             "state":"active",
>             "base_url":"https://solr04.myhost/solr",
>             "core":"collection2_shard53_replica2",
>             "node_name":"https://solr04.myhost/solr_solr"}}},
>
> Is this something known bug in 4.7 and late on fixed? Any reference JIRA we
> can study about?  If the solr service is stopped gracefully, we can see
> leader core election happens and switched to other active core. But if we
> just directly shutdown a Solr OS, we can reproduce in our environment that
> some "Down" cores remains "leader" at ZK clusterstate.json



--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Jeff Wu
Hi Shalin,  thank you for the response.

We waited longer enough than the ZK session timeout time, and it still did
not kick off any leader election for these "remained down-leader" cores.
That's the question I'm actually asking.

Our test scenario:

Each solr server has 64 cores, and they are all active, and all leader
cores.
Shutdown the linux OS.
Monitor clusterstate.json over ZK, after enough ZK session timeout value.
We noticed some cores has leader election happened. But still saw some down
cores remains leader.

2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar <[hidden email]>:

> Hi Jeff,
>
> The leader election relies on ephemeral nodes in Zookeeper to detect
> when leader or other nodes have gone down (abruptly). These ephemeral
> nodes are automatically deleted by ZooKeeper after the ZK session
> timeout which is by default 30 seconds. So if you kill a node then it
> can take up to 30 seconds for the cluster to detect it and start a new
> leader election. This won't be necessary during a graceful shutdown
> because on shutdown the node will give up leader position so that a
> new one can be elected. You could tune the zk session timeout to a
> lower value but then it makes the cluster more sensitive to GC pauses
> which can also trigger new leader elections.
>
> On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <[hidden email]> wrote:
> > Our environment still run with Solr4.7. Recently we noticed in a test.
> When
> > we stopped 1 solr server(solr02, which did OS shutdown), all the cores of
> > solr02 are shown as "down", but remains a few cores still as leaders.
> After
> > that, we quickly seeing all other servers are still sending requests to
> > that down solr server, and therefore we saw a lot of TCP waiting threads
> in
> > thread pool of other solr servers since solr02 already down.
> >
> > "shard53":{
> >         "range":"26660000-2998ffff",
> >         "state":"active",
> >         "replicas":{
> >           "core_node102":{
> >             "state":"down",
> >             "base_url":"https://solr02.myhost/solr",
> >             "core":"collection2_shard53_replica1",
> >             "node_name":"https://solr02.myhost_solr",
> >             "leader":"true"},
> >           "core_node104":{
> >             "state":"active",
> >             "base_url":"https://solr04.myhost/solr",
> >             "core":"collection2_shard53_replica2",
> >             "node_name":"https://solr04.myhost/solr_solr"}}},
> >
> > Is this something known bug in 4.7 and late on fixed? Any reference JIRA
> we
> > can study about?  If the solr service is stopped gracefully, we can see
> > leader core election happens and switched to other active core. But if we
> > just directly shutdown a Solr OS, we can reproduce in our environment
> that
> > some "Down" cores remains "leader" at ZK clusterstate.json
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
Reply | Threaded
Open this post in threaded view
|

Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Shai Erera
I don't think the process Shalin describes applies to clusterstate.json.
That JSON object reflects the status Solr "knows" about, or "last known
status". When Solr is properly shutdown, I believe those attributes are
cleared from clusterstate.json, as well the leaders give up their lease.

However, when Solr is killed, it takes ZK the 30 seconds or so timeout to
kill the ephemeral node and release the leader lease. ZK is unaware of
Solr's clusterstate.json and cannot update the 'leader' property to false.
It simply releases the lease, so that other cores may claim it.

Perhaps that explains the confusion?

Shai

On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu <[hidden email]> wrote:

> Hi Shalin,  thank you for the response.
>
> We waited longer enough than the ZK session timeout time, and it still did
> not kick off any leader election for these "remained down-leader" cores.
> That's the question I'm actually asking.
>
> Our test scenario:
>
> Each solr server has 64 cores, and they are all active, and all leader
> cores.
> Shutdown the linux OS.
> Monitor clusterstate.json over ZK, after enough ZK session timeout value.
> We noticed some cores has leader election happened. But still saw some down
> cores remains leader.
>
> 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar <[hidden email]>:
>
> > Hi Jeff,
> >
> > The leader election relies on ephemeral nodes in Zookeeper to detect
> > when leader or other nodes have gone down (abruptly). These ephemeral
> > nodes are automatically deleted by ZooKeeper after the ZK session
> > timeout which is by default 30 seconds. So if you kill a node then it
> > can take up to 30 seconds for the cluster to detect it and start a new
> > leader election. This won't be necessary during a graceful shutdown
> > because on shutdown the node will give up leader position so that a
> > new one can be elected. You could tune the zk session timeout to a
> > lower value but then it makes the cluster more sensitive to GC pauses
> > which can also trigger new leader elections.
> >
> > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <[hidden email]> wrote:
> > > Our environment still run with Solr4.7. Recently we noticed in a test.
> > When
> > > we stopped 1 solr server(solr02, which did OS shutdown), all the cores
> of
> > > solr02 are shown as "down", but remains a few cores still as leaders.
> > After
> > > that, we quickly seeing all other servers are still sending requests to
> > > that down solr server, and therefore we saw a lot of TCP waiting
> threads
> > in
> > > thread pool of other solr servers since solr02 already down.
> > >
> > > "shard53":{
> > >         "range":"26660000-2998ffff",
> > >         "state":"active",
> > >         "replicas":{
> > >           "core_node102":{
> > >             "state":"down",
> > >             "base_url":"https://solr02.myhost/solr",
> > >             "core":"collection2_shard53_replica1",
> > >             "node_name":"https://solr02.myhost_solr",
> > >             "leader":"true"},
> > >           "core_node104":{
> > >             "state":"active",
> > >             "base_url":"https://solr04.myhost/solr",
> > >             "core":"collection2_shard53_replica2",
> > >             "node_name":"https://solr04.myhost/solr_solr"}}},
> > >
> > > Is this something known bug in 4.7 and late on fixed? Any reference
> JIRA
> > we
> > > can study about?  If the solr service is stopped gracefully, we can see
> > > leader core election happens and switched to other active core. But if
> we
> > > just directly shutdown a Solr OS, we can reproduce in our environment
> > that
> > > some "Down" cores remains "leader" at ZK clusterstate.json
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Jeff Wu
Hi Shai, still the same question: other peer cores which they are active
did not claim to be leader after a long time.  However, some of the peer
cores claimed to be leaders at earlier time when server stopping. That's
inconsistent results

2015-09-21 10:52 GMT-04:00 Shai Erera <[hidden email]>:

> I don't think the process Shalin describes applies to clusterstate.json.
> That JSON object reflects the status Solr "knows" about, or "last known
> status". When Solr is properly shutdown, I believe those attributes are
> cleared from clusterstate.json, as well the leaders give up their lease.
>
> However, when Solr is killed, it takes ZK the 30 seconds or so timeout to
> kill the ephemeral node and release the leader lease. ZK is unaware of
> Solr's clusterstate.json and cannot update the 'leader' property to false.
> It simply releases the lease, so that other cores may claim it.
>
> Perhaps that explains the confusion?
>
> Shai
>
> On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu <[hidden email]> wrote:
>
> > Hi Shalin,  thank you for the response.
> >
> > We waited longer enough than the ZK session timeout time, and it still
> did
> > not kick off any leader election for these "remained down-leader" cores.
> > That's the question I'm actually asking.
> >
> > Our test scenario:
> >
> > Each solr server has 64 cores, and they are all active, and all leader
> > cores.
> > Shutdown the linux OS.
> > Monitor clusterstate.json over ZK, after enough ZK session timeout value.
> > We noticed some cores has leader election happened. But still saw some
> down
> > cores remains leader.
> >
> > 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar <[hidden email]
> >:
> >
> > > Hi Jeff,
> > >
> > > The leader election relies on ephemeral nodes in Zookeeper to detect
> > > when leader or other nodes have gone down (abruptly). These ephemeral
> > > nodes are automatically deleted by ZooKeeper after the ZK session
> > > timeout which is by default 30 seconds. So if you kill a node then it
> > > can take up to 30 seconds for the cluster to detect it and start a new
> > > leader election. This won't be necessary during a graceful shutdown
> > > because on shutdown the node will give up leader position so that a
> > > new one can be elected. You could tune the zk session timeout to a
> > > lower value but then it makes the cluster more sensitive to GC pauses
> > > which can also trigger new leader elections.
> > >
> > > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <[hidden email]> wrote:
> > > > Our environment still run with Solr4.7. Recently we noticed in a
> test.
> > > When
> > > > we stopped 1 solr server(solr02, which did OS shutdown), all the
> cores
> > of
> > > > solr02 are shown as "down", but remains a few cores still as leaders.
> > > After
> > > > that, we quickly seeing all other servers are still sending requests
> to
> > > > that down solr server, and therefore we saw a lot of TCP waiting
> > threads
> > > in
> > > > thread pool of other solr servers since solr02 already down.
> > > >
> > > > "shard53":{
> > > >         "range":"26660000-2998ffff",
> > > >         "state":"active",
> > > >         "replicas":{
> > > >           "core_node102":{
> > > >             "state":"down",
> > > >             "base_url":"https://solr02.myhost/solr",
> > > >             "core":"collection2_shard53_replica1",
> > > >             "node_name":"https://solr02.myhost_solr",
> > > >             "leader":"true"},
> > > >           "core_node104":{
> > > >             "state":"active",
> > > >             "base_url":"https://solr04.myhost/solr",
> > > >             "core":"collection2_shard53_replica2",
> > > >             "node_name":"https://solr04.myhost/solr_solr"}}},
> > > >
> > > > Is this something known bug in 4.7 and late on fixed? Any reference
> > JIRA
> > > we
> > > > can study about?  If the solr service is stopped gracefully, we can
> see
> > > > leader core election happens and switched to other active core. But
> if
> > we
> > > > just directly shutdown a Solr OS, we can reproduce in our environment
> > > that
> > > > some "Down" cores remains "leader" at ZK clusterstate.json
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>



--
Jeff Wu
---------------------------
CSDL Beijing, China
Reply | Threaded
Open this post in threaded view
|

Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

Gili Nachum-2
Happens to us too. Solr 4.7.2
On Sep 21, 2015 20:42, "Jeff Wu" <[hidden email]> wrote:

> Hi Shai, still the same question: other peer cores which they are active
> did not claim to be leader after a long time.  However, some of the peer
> cores claimed to be leaders at earlier time when server stopping. That's
> inconsistent results
>
> 2015-09-21 10:52 GMT-04:00 Shai Erera <[hidden email]>:
>
> > I don't think the process Shalin describes applies to clusterstate.json.
> > That JSON object reflects the status Solr "knows" about, or "last known
> > status". When Solr is properly shutdown, I believe those attributes are
> > cleared from clusterstate.json, as well the leaders give up their lease.
> >
> > However, when Solr is killed, it takes ZK the 30 seconds or so timeout to
> > kill the ephemeral node and release the leader lease. ZK is unaware of
> > Solr's clusterstate.json and cannot update the 'leader' property to
> false.
> > It simply releases the lease, so that other cores may claim it.
> >
> > Perhaps that explains the confusion?
> >
> > Shai
> >
> > On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu <[hidden email]> wrote:
> >
> > > Hi Shalin,  thank you for the response.
> > >
> > > We waited longer enough than the ZK session timeout time, and it still
> > did
> > > not kick off any leader election for these "remained down-leader"
> cores.
> > > That's the question I'm actually asking.
> > >
> > > Our test scenario:
> > >
> > > Each solr server has 64 cores, and they are all active, and all leader
> > > cores.
> > > Shutdown the linux OS.
> > > Monitor clusterstate.json over ZK, after enough ZK session timeout
> value.
> > > We noticed some cores has leader election happened. But still saw some
> > down
> > > cores remains leader.
> > >
> > > 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar <
> [hidden email]
> > >:
> > >
> > > > Hi Jeff,
> > > >
> > > > The leader election relies on ephemeral nodes in Zookeeper to detect
> > > > when leader or other nodes have gone down (abruptly). These ephemeral
> > > > nodes are automatically deleted by ZooKeeper after the ZK session
> > > > timeout which is by default 30 seconds. So if you kill a node then it
> > > > can take up to 30 seconds for the cluster to detect it and start a
> new
> > > > leader election. This won't be necessary during a graceful shutdown
> > > > because on shutdown the node will give up leader position so that a
> > > > new one can be elected. You could tune the zk session timeout to a
> > > > lower value but then it makes the cluster more sensitive to GC pauses
> > > > which can also trigger new leader elections.
> > > >
> > > > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <[hidden email]> wrote:
> > > > > Our environment still run with Solr4.7. Recently we noticed in a
> > test.
> > > > When
> > > > > we stopped 1 solr server(solr02, which did OS shutdown), all the
> > cores
> > > of
> > > > > solr02 are shown as "down", but remains a few cores still as
> leaders.
> > > > After
> > > > > that, we quickly seeing all other servers are still sending
> requests
> > to
> > > > > that down solr server, and therefore we saw a lot of TCP waiting
> > > threads
> > > > in
> > > > > thread pool of other solr servers since solr02 already down.
> > > > >
> > > > > "shard53":{
> > > > >         "range":"26660000-2998ffff",
> > > > >         "state":"active",
> > > > >         "replicas":{
> > > > >           "core_node102":{
> > > > >             "state":"down",
> > > > >             "base_url":"https://solr02.myhost/solr",
> > > > >             "core":"collection2_shard53_replica1",
> > > > >             "node_name":"https://solr02.myhost_solr",
> > > > >             "leader":"true"},
> > > > >           "core_node104":{
> > > > >             "state":"active",
> > > > >             "base_url":"https://solr04.myhost/solr",
> > > > >             "core":"collection2_shard53_replica2",
> > > > >             "node_name":"https://solr04.myhost/solr_solr"}}},
> > > > >
> > > > > Is this something known bug in 4.7 and late on fixed? Any reference
> > > JIRA
> > > > we
> > > > > can study about?  If the solr service is stopped gracefully, we can
> > see
> > > > > leader core election happens and switched to other active core. But
> > if
> > > we
> > > > > just directly shutdown a Solr OS, we can reproduce in our
> environment
> > > > that
> > > > > some "Down" cores remains "leader" at ZK clusterstate.json
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Shalin Shekhar Mangar.
> > > >
> > >
> >
>
>
>
> --
> Jeff Wu
> ---------------------------
> CSDL Beijing, China
>