Question about "No registered leader" error

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about "No registered leader" error

interma
Hi all
I got an error when I was doing index operation:

"2019-09-18 02:35:44.427244 ... No registered leader was found after waiting for 4000ms , collection: foo slice: shard2"

Beside it, there is no other error in solr log.

Collection foo have 2 shards, then I check their jvm gc log:

  *   2019-09-18T02:34:08.252+0000: 150961.017: Total time for which application threads were stopped: 10.4617864 seconds, Stopping threads took: 0.0005226 seconds

  *   2019-09-18T02:34:30.194+0000: 151014.108: Total time for which application threads were stopped: 44.4809415 seconds, Stopping threads took: 0.0005976 seconds

I saw there are long gc pauses at the near timepoint.

My questions:

  *   Is this error possible caused by "long gc pause"? my solr zkClientTimeout=60000
  *   If so, how can I prevent this error happen? My thoughts: using G1 collector (as https://cwiki.apache.org/confluence/display/SOLR/ShawnHeisey#ShawnHeisey-GCTuningforSolr) or enlarge zkClientTimeout again, what's your idea?


Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Question about "No registered leader" error

Shawn Heisey-2
On 9/17/2019 9:35 PM, Hongxu Ma wrote:
> My questions:
>
>    *   Is this error possible caused by "long gc pause"? my solr zkClientTimeout=60000

It's possible.  I can't say for sure that this is the issue, but it
might be.

>    *   If so, how can I prevent this error happen? My thoughts: using G1 collector (as https://cwiki.apache.org/confluence/display/SOLR/ShawnHeisey#ShawnHeisey-GCTuningforSolr) or enlarge zkClientTimeout again, what's your idea?

If your ZK server ticktime setting is the typical value of 2000, that
means that the largest value you can use for the ZK timeout (which
Solr's zkClientTimeout value ultimately gets used to set) is 40 seconds
-- 20 times the ticktime is the biggest value ZK will allow.

So if your ZK server ticktime is 2000 milliseconds, you're not actually
getting 60 seconds, and I don't know what happens when you try ... I
would expect ZK to either just use its max value or ignore the setting
entirely, and I do not know which it is.  That's something we should ask
the ZK mailing list and/or do testing on.

Dealing with the the "no registered leader" problem probably will
involve restarting at least one of the Solr server JVMs in the cloud,
and if that doesn't work, restart all of them.

What version of Solr do you have, and what is your max heap?  The CMS
garbage collection that Solr 5.0 and later incorporate by default is
pretty good.  My G1 settings might do slightly better, but the
improvement won't be dramatic unless your existing commandline has
absolutely no gc tuning at all.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Question about "No registered leader" error

Shawn Heisey-2
On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> On 9/17/2019 9:35 PM, Hongxu Ma wrote:
>> My questions:
>>
>>    *   Is this error possible caused by "long gc pause"? my solr
>> zkClientTimeout=60000
>
> It's possible.  I can't say for sure that this is the issue, but it
> might be.

A followup.  I was thinking about the interactions here.  It looks like
Solr only waits four seconds for the leader election, and both of the
pauses you mentioned are longer than that.

Four seconds is probably too short a time to wait, and I do not think
that timeout is configurable anywhere.

> What version of Solr do you have, and what is your max heap?  The CMS
> garbage collection that Solr 5.0 and later incorporate by default is
> pretty good.  My G1 settings might do slightly better, but the
> improvement won't be dramatic unless your existing commandline has
> absolutely no gc tuning at all.

That question will be important.  If you already have our CMS GC tuning,
switching to G1 probably is not going to solve this.  Lowering the max
heap might be the only viable solution in that case, and depending on
what you're dealing with, it will either be impossible or it will
require more servers.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Question about "No registered leader" error

Erick Erickson
Check whether the oom killer script was called. If so, there will be
log files obviously relating to that. I've seen nodes mysteriously
disappear as a result of this with no message in the regular solr
logs. If that's the case, you need to increase your heap.

Erick

On Wed, Sep 18, 2019 at 8:21 AM Shawn Heisey <[hidden email]> wrote:

>
> On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> > On 9/17/2019 9:35 PM, Hongxu Ma wrote:
> >> My questions:
> >>
> >>    *   Is this error possible caused by "long gc pause"? my solr
> >> zkClientTimeout=60000
> >
> > It's possible.  I can't say for sure that this is the issue, but it
> > might be.
>
> A followup.  I was thinking about the interactions here.  It looks like
> Solr only waits four seconds for the leader election, and both of the
> pauses you mentioned are longer than that.
>
> Four seconds is probably too short a time to wait, and I do not think
> that timeout is configurable anywhere.
>
> > What version of Solr do you have, and what is your max heap?  The CMS
> > garbage collection that Solr 5.0 and later incorporate by default is
> > pretty good.  My G1 settings might do slightly better, but the
> > improvement won't be dramatic unless your existing commandline has
> > absolutely no gc tuning at all.
>
> That question will be important.  If you already have our CMS GC tuning,
> switching to G1 probably is not going to solve this.  Lowering the max
> heap might be the only viable solution in that case, and depending on
> what you're dealing with, it will either be impossible or it will
> require more servers.
>
> Thanks,
> Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Question about "No registered leader" error

interma
In reply to this post by Shawn Heisey-2
@Shawn @Erick Thanks for your kindle help!

No OOM log and I confirm there was no OOM happened.

My ZK ticktime is set to 5000, so 5000*20 = 100s > 60s, and I checked solr code: the leader waiting time: 4000ms is a const variable, is not configurable. (why it isn't a configurable param?)

My solr version is 7.3.1, xmx = 30000MB (via solr UI, peak memory is 22GB)
I have already used CMS GC tuning (param has a little difference from your wiki page).

I will try the following advice:

  *   lower heap size
  *   turn to G1 (the same param as wiki)
  *   try to restart one SOLR node when this error happens.

Thanks again.

________________________________
From: Shawn Heisey <[hidden email]>
Sent: Wednesday, September 18, 2019 20:21
To: [hidden email] <[hidden email]>
Subject: Re: Question about "No registered leader" error

On 9/18/2019 6:11 AM, Shawn Heisey wrote:
> On 9/17/2019 9:35 PM, Hongxu Ma wrote:
>> My questions:
>>
>>    *   Is this error possible caused by "long gc pause"? my solr
>> zkClientTimeout=60000
>
> It's possible.  I can't say for sure that this is the issue, but it
> might be.

A followup.  I was thinking about the interactions here.  It looks like
Solr only waits four seconds for the leader election, and both of the
pauses you mentioned are longer than that.

Four seconds is probably too short a time to wait, and I do not think
that timeout is configurable anywhere.

> What version of Solr do you have, and what is your max heap?  The CMS
> garbage collection that Solr 5.0 and later incorporate by default is
> pretty good.  My G1 settings might do slightly better, but the
> improvement won't be dramatic unless your existing commandline has
> absolutely no gc tuning at all.

That question will be important.  If you already have our CMS GC tuning,
switching to G1 probably is not going to solve this.  Lowering the max
heap might be the only viable solution in that case, and depending on
what you're dealing with, it will either be impossible or it will
require more servers.

Thanks,
Shawn