Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Doss
Hi,

We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT)
with index size ranging from 700MB to 20GB.

autoCommit - 10 minutes once
softCommit - 30 Sec Once

At peak time if a shard goes to recovery mode many other shards also going
to recovery mode in few minutes, which creates huge load (200+ load
average) and SOLR becomes non responsive. To fix this we are restarting the
node, again leader tries to correct the index by initiating replication,
which causes load again, and the node goes to non responsive state.

As soon as a node starts the replication process initiated for all 130
cores, is there any we control it, like one after the other?

Thanks,
Doss.
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Jörn Franke
1 Node zookeeper ensemble does not sound very healthy

> Am 05.09.2019 um 13:07 schrieb Doss <[hidden email]>:
>
> Hi,
>
> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT)
> with index size ranging from 700MB to 20GB.
>
> autoCommit - 10 minutes once
> softCommit - 30 Sec Once
>
> At peak time if a shard goes to recovery mode many other shards also going
> to recovery mode in few minutes, which creates huge load (200+ load
> average) and SOLR becomes non responsive. To fix this we are restarting the
> node, again leader tries to correct the index by initiating replication,
> which causes load again, and the node goes to non responsive state.
>
> As soon as a node starts the replication process initiated for all 130
> cores, is there any we control it, like one after the other?
>
> Thanks,
> Doss.
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Erick Erickson
In reply to this post by Doss
If I'm reading this correctly, you have a huge amount of index in not much
memory. You only have 14g allocated across 130 replicas, at least one of
which has a 20g index. You don't need as much memory as your aggregate
index size, but this system feels severely under provisioned. I suspect
that's the root of your instability

Best,
Erick

On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:

> Hi,
>
> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT)
> with index size ranging from 700MB to 20GB.
>
> autoCommit - 10 minutes once
> softCommit - 30 Sec Once
>
> At peak time if a shard goes to recovery mode many other shards also going
> to recovery mode in few minutes, which creates huge load (200+ load
> average) and SOLR becomes non responsive. To fix this we are restarting the
> node, again leader tries to correct the index by initiating replication,
> which causes load again, and the node goes to non responsive state.
>
> As soon as a node starts the replication process initiated for all 130
> cores, is there any we control it, like one after the other?
>
> Thanks,
> Doss.
>
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Doss
@Jorn We are adding few more zookeeper nodes soon. Thanks.

@ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per node,
out of which 14 GB assigned for HEAP, you mean to say we have to allocate
more HEAP? or we need add more Physical RAM?

This system ran for 8 to 9 months without any major issues, in recent times
only we are facing too many such incidents.

On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <[hidden email]>
wrote:

> If I'm reading this correctly, you have a huge amount of index in not much
> memory. You only have 14g allocated across 130 replicas, at least one of
> which has a 20g index. You don't need as much memory as your aggregate
> index size, but this system feels severely under provisioned. I suspect
> that's the root of your instability
>
> Best,
> Erick
>
> On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:
>
> > Hi,
> >
> > We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
> > Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT)
> > with index size ranging from 700MB to 20GB.
> >
> > autoCommit - 10 minutes once
> > softCommit - 30 Sec Once
> >
> > At peak time if a shard goes to recovery mode many other shards also
> going
> > to recovery mode in few minutes, which creates huge load (200+ load
> > average) and SOLR becomes non responsive. To fix this we are restarting
> the
> > node, again leader tries to correct the index by initiating replication,
> > which causes load again, and the node goes to non responsive state.
> >
> > As soon as a node starts the replication process initiated for all 130
> > cores, is there any we control it, like one after the other?
> >
> > Thanks,
> > Doss.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Erick Erickson
You say you have three nodes, 130 replicas and a replication factor of 3, so
you have 130 cores/node. At least one of those cores has a 20G index, right?

What is the sum of all the indexes on a single physical machine?

I think your system is under-provisioned and that you’ve been riding at the edge
of instability for quite some time and have added enough more docs that
you finally reached a tipping point. But that’s largely speculation.

So adding more heap may help. But Real Soon Now you need to think about adding
more hardware and moving some of your replicas to that new hardware.

Again, this is speculation. But when systems are running with an _aggregate_
index size that is many multiples of the available memory (total phycisal memory)
it’s a red flag. I’m guessing a bit since I don’t know the aggregate for all replicas…

Best,
Erick

> On Sep 5, 2019, at 8:08 AM, Doss <[hidden email]> wrote:
>
> @Jorn We are adding few more zookeeper nodes soon. Thanks.
>
> @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per node,
> out of which 14 GB assigned for HEAP, you mean to say we have to allocate
> more HEAP? or we need add more Physical RAM?
>
> This system ran for 8 to 9 months without any major issues, in recent times
> only we are facing too many such incidents.
>
> On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <[hidden email]>
> wrote:
>
>> If I'm reading this correctly, you have a huge amount of index in not much
>> memory. You only have 14g allocated across 130 replicas, at least one of
>> which has a 20g index. You don't need as much memory as your aggregate
>> index size, but this system feels severely under provisioned. I suspect
>> that's the root of your instability
>>
>> Best,
>> Erick
>>
>> On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:
>>
>>> Hi,
>>>
>>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
>>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT)
>>> with index size ranging from 700MB to 20GB.
>>>
>>> autoCommit - 10 minutes once
>>> softCommit - 30 Sec Once
>>>
>>> At peak time if a shard goes to recovery mode many other shards also
>> going
>>> to recovery mode in few minutes, which creates huge load (200+ load
>>> average) and SOLR becomes non responsive. To fix this we are restarting
>> the
>>> node, again leader tries to correct the index by initiating replication,
>>> which causes load again, and the node goes to non responsive state.
>>>
>>> As soon as a node starts the replication process initiated for all 130
>>> cores, is there any we control it, like one after the other?
>>>
>>> Thanks,
>>> Doss.
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Doss
Thanks Eric for the explanation. Sum of all our index size is about 138 GB,
only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
require at least couple of days, till that time is there any option to
control the replication method?

Thanks,
Doss.

On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <[hidden email]>
wrote:

> You say you have three nodes, 130 replicas and a replication factor of 3,
> so
> you have 130 cores/node. At least one of those cores has a 20G index,
> right?
>
> What is the sum of all the indexes on a single physical machine?
>
> I think your system is under-provisioned and that you’ve been riding at
> the edge
> of instability for quite some time and have added enough more docs that
> you finally reached a tipping point. But that’s largely speculation.
>
> So adding more heap may help. But Real Soon Now you need to think about
> adding
> more hardware and moving some of your replicas to that new hardware.
>
> Again, this is speculation. But when systems are running with an
> _aggregate_
> index size that is many multiples of the available memory (total phycisal
> memory)
> it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
> all replicas…
>
> Best,
> Erick
>
> > On Sep 5, 2019, at 8:08 AM, Doss <[hidden email]> wrote:
> >
> > @Jorn We are adding few more zookeeper nodes soon. Thanks.
> >
> > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
> node,
> > out of which 14 GB assigned for HEAP, you mean to say we have to allocate
> > more HEAP? or we need add more Physical RAM?
> >
> > This system ran for 8 to 9 months without any major issues, in recent
> times
> > only we are facing too many such incidents.
> >
> > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> >> If I'm reading this correctly, you have a huge amount of index in not
> much
> >> memory. You only have 14g allocated across 130 replicas, at least one of
> >> which has a 20g index. You don't need as much memory as your aggregate
> >> index size, but this system feels severely under provisioned. I suspect
> >> that's the root of your instability
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:
> >>
> >>> Hi,
> >>>
> >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
> >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
> NRT)
> >>> with index size ranging from 700MB to 20GB.
> >>>
> >>> autoCommit - 10 minutes once
> >>> softCommit - 30 Sec Once
> >>>
> >>> At peak time if a shard goes to recovery mode many other shards also
> >> going
> >>> to recovery mode in few minutes, which creates huge load (200+ load
> >>> average) and SOLR becomes non responsive. To fix this we are restarting
> >> the
> >>> node, again leader tries to correct the index by initiating
> replication,
> >>> which causes load again, and the node goes to non responsive state.
> >>>
> >>> As soon as a node starts the replication process initiated for all 130
> >>> cores, is there any we control it, like one after the other?
> >>>
> >>> Thanks,
> >>> Doss.
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Jack Schlederer
I'd defer to the committers if they have any further advice, but you might
have to suspend the autoAddReplicas trigger through the autoscaling API (
https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/ )
if you set up your collections with autoAddReplicas enabled. Then, the
system will not try to re-create missing replicas.

Just another note on your setup-- It seems to me like using only 3 nodes
for 168 GB worth of indices isn't making the most of SolrCloud, which
provides the capabilities for sharding indices across a high number of
nodes. Just a data point for you to consider when considering your cluster
sizing, my org is running only about 50GB of indices, but we run it over 35
nodes with 8GB of heap apiece, each collection with 2+ shards.

--Jack

On Thu, Sep 5, 2019 at 8:47 AM Doss <[hidden email]> wrote:

> Thanks Eric for the explanation. Sum of all our index size is about 138 GB,
> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
> require at least couple of days, till that time is there any option to
> control the replication method?
>
> Thanks,
> Doss.
>
> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <[hidden email]>
> wrote:
>
> > You say you have three nodes, 130 replicas and a replication factor of 3,
> > so
> > you have 130 cores/node. At least one of those cores has a 20G index,
> > right?
> >
> > What is the sum of all the indexes on a single physical machine?
> >
> > I think your system is under-provisioned and that you’ve been riding at
> > the edge
> > of instability for quite some time and have added enough more docs that
> > you finally reached a tipping point. But that’s largely speculation.
> >
> > So adding more heap may help. But Real Soon Now you need to think about
> > adding
> > more hardware and moving some of your replicas to that new hardware.
> >
> > Again, this is speculation. But when systems are running with an
> > _aggregate_
> > index size that is many multiples of the available memory (total phycisal
> > memory)
> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
> > all replicas…
> >
> > Best,
> > Erick
> >
> > > On Sep 5, 2019, at 8:08 AM, Doss <[hidden email]> wrote:
> > >
> > > @Jorn We are adding few more zookeeper nodes soon. Thanks.
> > >
> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
> > node,
> > > out of which 14 GB assigned for HEAP, you mean to say we have to
> allocate
> > > more HEAP? or we need add more Physical RAM?
> > >
> > > This system ran for 8 to 9 months without any major issues, in recent
> > times
> > > only we are facing too many such incidents.
> > >
> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <[hidden email]
> >
> > > wrote:
> > >
> > >> If I'm reading this correctly, you have a huge amount of index in not
> > much
> > >> memory. You only have 14g allocated across 130 replicas, at least one
> of
> > >> which has a 20g index. You don't need as much memory as your aggregate
> > >> index size, but this system feels severely under provisioned. I
> suspect
> > >> that's the root of your instability
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper
> ensemble.
> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
> > NRT)
> > >>> with index size ranging from 700MB to 20GB.
> > >>>
> > >>> autoCommit - 10 minutes once
> > >>> softCommit - 30 Sec Once
> > >>>
> > >>> At peak time if a shard goes to recovery mode many other shards also
> > >> going
> > >>> to recovery mode in few minutes, which creates huge load (200+ load
> > >>> average) and SOLR becomes non responsive. To fix this we are
> restarting
> > >> the
> > >>> node, again leader tries to correct the index by initiating
> > replication,
> > >>> which causes load again, and the node goes to non responsive state.
> > >>>
> > >>> As soon as a node starts the replication process initiated for all
> 130
> > >>> cores, is there any we control it, like one after the other?
> > >>>
> > >>> Thanks,
> > >>> Doss.
> > >>>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Jack Schlederer
My mistake on the link, which should be this:
https://lucene.apache.org/solr/guide/7_1/solrcloud-autoscaling-auto-add-replicas.html#implementation-using-autoaddreplicas-trigger

--Jack

On Thu, Sep 5, 2019 at 11:02 AM Jack Schlederer <[hidden email]>
wrote:

> I'd defer to the committers if they have any further advice, but you might
> have to suspend the autoAddReplicas trigger through the autoscaling API (
> https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/ )
> if you set up your collections with autoAddReplicas enabled. Then, the
> system will not try to re-create missing replicas.
>
> Just another note on your setup-- It seems to me like using only 3 nodes
> for 168 GB worth of indices isn't making the most of SolrCloud, which
> provides the capabilities for sharding indices across a high number of
> nodes. Just a data point for you to consider when considering your cluster
> sizing, my org is running only about 50GB of indices, but we run it over 35
> nodes with 8GB of heap apiece, each collection with 2+ shards.
>
> --Jack
>
> On Thu, Sep 5, 2019 at 8:47 AM Doss <[hidden email]> wrote:
>
>> Thanks Eric for the explanation. Sum of all our index size is about 138
>> GB,
>> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
>> require at least couple of days, till that time is there any option to
>> control the replication method?
>>
>> Thanks,
>> Doss.
>>
>> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <[hidden email]>
>> wrote:
>>
>> > You say you have three nodes, 130 replicas and a replication factor of
>> 3,
>> > so
>> > you have 130 cores/node. At least one of those cores has a 20G index,
>> > right?
>> >
>> > What is the sum of all the indexes on a single physical machine?
>> >
>> > I think your system is under-provisioned and that you’ve been riding at
>> > the edge
>> > of instability for quite some time and have added enough more docs that
>> > you finally reached a tipping point. But that’s largely speculation.
>> >
>> > So adding more heap may help. But Real Soon Now you need to think about
>> > adding
>> > more hardware and moving some of your replicas to that new hardware.
>> >
>> > Again, this is speculation. But when systems are running with an
>> > _aggregate_
>> > index size that is many multiples of the available memory (total
>> phycisal
>> > memory)
>> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
>> > all replicas…
>> >
>> > Best,
>> > Erick
>> >
>> > > On Sep 5, 2019, at 8:08 AM, Doss <[hidden email]> wrote:
>> > >
>> > > @Jorn We are adding few more zookeeper nodes soon. Thanks.
>> > >
>> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
>> > node,
>> > > out of which 14 GB assigned for HEAP, you mean to say we have to
>> allocate
>> > > more HEAP? or we need add more Physical RAM?
>> > >
>> > > This system ran for 8 to 9 months without any major issues, in recent
>> > times
>> > > only we are facing too many such incidents.
>> > >
>> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <
>> [hidden email]>
>> > > wrote:
>> > >
>> > >> If I'm reading this correctly, you have a huge amount of index in not
>> > much
>> > >> memory. You only have 14g allocated across 130 replicas, at least
>> one of
>> > >> which has a 20g index. You don't need as much memory as your
>> aggregate
>> > >> index size, but this system feels severely under provisioned. I
>> suspect
>> > >> that's the root of your instability
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper
>> ensemble.
>> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
>> > NRT)
>> > >>> with index size ranging from 700MB to 20GB.
>> > >>>
>> > >>> autoCommit - 10 minutes once
>> > >>> softCommit - 30 Sec Once
>> > >>>
>> > >>> At peak time if a shard goes to recovery mode many other shards also
>> > >> going
>> > >>> to recovery mode in few minutes, which creates huge load (200+ load
>> > >>> average) and SOLR becomes non responsive. To fix this we are
>> restarting
>> > >> the
>> > >>> node, again leader tries to correct the index by initiating
>> > replication,
>> > >>> which causes load again, and the node goes to non responsive state.
>> > >>>
>> > >>> As soon as a node starts the replication process initiated for all
>> 130
>> > >>> cores, is there any we control it, like one after the other?
>> > >>>
>> > >>> Thanks,
>> > >>> Doss.
>> > >>>
>> > >>
>> >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Doss
Dear Jack,

Thanks for your input. Non of our cores were created with autoAddReplicas.
The problem we are facing is, upon rebooting leader tries to sync the data
with other nodes which are part of the cluster.

Thanks,
Doss.

On Thu, Sep 5, 2019 at 9:46 PM Jack Schlederer <[hidden email]>
wrote:

> My mistake on the link, which should be this:
>
> https://lucene.apache.org/solr/guide/7_1/solrcloud-autoscaling-auto-add-replicas.html#implementation-using-autoaddreplicas-trigger
>
> --Jack
>
> On Thu, Sep 5, 2019 at 11:02 AM Jack Schlederer <[hidden email]>
> wrote:
>
> > I'd defer to the committers if they have any further advice, but you
> might
> > have to suspend the autoAddReplicas trigger through the autoscaling API (
> >
> https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/
> )
> > if you set up your collections with autoAddReplicas enabled. Then, the
> > system will not try to re-create missing replicas.
> >
> > Just another note on your setup-- It seems to me like using only 3 nodes
> > for 168 GB worth of indices isn't making the most of SolrCloud, which
> > provides the capabilities for sharding indices across a high number of
> > nodes. Just a data point for you to consider when considering your
> cluster
> > sizing, my org is running only about 50GB of indices, but we run it over
> 35
> > nodes with 8GB of heap apiece, each collection with 2+ shards.
> >
> > --Jack
> >
> > On Thu, Sep 5, 2019 at 8:47 AM Doss <[hidden email]> wrote:
> >
> >> Thanks Eric for the explanation. Sum of all our index size is about 138
> >> GB,
> >> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware
> will
> >> require at least couple of days, till that time is there any option to
> >> control the replication method?
> >>
> >> Thanks,
> >> Doss.
> >>
> >> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <[hidden email]>
> >> wrote:
> >>
> >> > You say you have three nodes, 130 replicas and a replication factor of
> >> 3,
> >> > so
> >> > you have 130 cores/node. At least one of those cores has a 20G index,
> >> > right?
> >> >
> >> > What is the sum of all the indexes on a single physical machine?
> >> >
> >> > I think your system is under-provisioned and that you’ve been riding
> at
> >> > the edge
> >> > of instability for quite some time and have added enough more docs
> that
> >> > you finally reached a tipping point. But that’s largely speculation.
> >> >
> >> > So adding more heap may help. But Real Soon Now you need to think
> about
> >> > adding
> >> > more hardware and moving some of your replicas to that new hardware.
> >> >
> >> > Again, this is speculation. But when systems are running with an
> >> > _aggregate_
> >> > index size that is many multiples of the available memory (total
> >> phycisal
> >> > memory)
> >> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate
> for
> >> > all replicas…
> >> >
> >> > Best,
> >> > Erick
> >> >
> >> > > On Sep 5, 2019, at 8:08 AM, Doss <[hidden email]> wrote:
> >> > >
> >> > > @Jorn We are adding few more zookeeper nodes soon. Thanks.
> >> > >
> >> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM
> per
> >> > node,
> >> > > out of which 14 GB assigned for HEAP, you mean to say we have to
> >> allocate
> >> > > more HEAP? or we need add more Physical RAM?
> >> > >
> >> > > This system ran for 8 to 9 months without any major issues, in
> recent
> >> > times
> >> > > only we are facing too many such incidents.
> >> > >
> >> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <
> >> [hidden email]>
> >> > > wrote:
> >> > >
> >> > >> If I'm reading this correctly, you have a huge amount of index in
> not
> >> > much
> >> > >> memory. You only have 14g allocated across 130 replicas, at least
> >> one of
> >> > >> which has a 20g index. You don't need as much memory as your
> >> aggregate
> >> > >> index size, but this system feels severely under provisioned. I
> >> suspect
> >> > >> that's the root of your instability
> >> > >>
> >> > >> Best,
> >> > >> Erick
> >> > >>
> >> > >> On Thu, Sep 5, 2019, 07:08 Doss <[hidden email]> wrote:
> >> > >>
> >> > >>> Hi,
> >> > >>>
> >> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper
> >> ensemble.
> >> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3
> replicas
> >> > NRT)
> >> > >>> with index size ranging from 700MB to 20GB.
> >> > >>>
> >> > >>> autoCommit - 10 minutes once
> >> > >>> softCommit - 30 Sec Once
> >> > >>>
> >> > >>> At peak time if a shard goes to recovery mode many other shards
> also
> >> > >> going
> >> > >>> to recovery mode in few minutes, which creates huge load (200+
> load
> >> > >>> average) and SOLR becomes non responsive. To fix this we are
> >> restarting
> >> > >> the
> >> > >>> node, again leader tries to correct the index by initiating
> >> > replication,
> >> > >>> which causes load again, and the node goes to non responsive
> state.
> >> > >>>
> >> > >>> As soon as a node starts the replication process initiated for all
> >> 130
> >> > >>> cores, is there any we control it, like one after the other?
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Doss.
> >> > >>>
> >> > >>
> >> >
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Doss
In reply to this post by Jörn Franke
Jorn we have add additional zookeeper nodes, now it is a 3 node quorum.

Does all nodes in a quorum sends heart beat request to all cores and shards
?

If zookeeper node 1 unable to communicate with a shard and it declares that
shard as dead, now this state can be changed by zookeeper node 2 if it got
a successful response from that particular shard?

On Thu, Sep 5, 2019 at 4:53 PM Jörn Franke <[hidden email]> wrote:

> 1 Node zookeeper ensemble does not sound very healthy
>
> > Am 05.09.2019 um 13:07 schrieb Doss <[hidden email]>:
> >
> > Hi,
> >
> > We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
> > Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT)
> > with index size ranging from 700MB to 20GB.
> >
> > autoCommit - 10 minutes once
> > softCommit - 30 Sec Once
> >
> > At peak time if a shard goes to recovery mode many other shards also
> going
> > to recovery mode in few minutes, which creates huge load (200+ load
> > average) and SOLR becomes non responsive. To fix this we are restarting
> the
> > node, again leader tries to correct the index by initiating replication,
> > which causes load again, and the node goes to non responsive state.
> >
> > As soon as a node starts the replication process initiated for all 130
> > cores, is there any we control it, like one after the other?
> >
> > Thanks,
> > Doss.
>