SolrCloud one server with high load

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud one server with high load

Gael Jourdan-Weil
Hello,

I come again to the community for some ideas regarding a performance issue we are having.

We have a SolrCloud cluster of 3 servers.
Each server hosts 1 replica of 2 collections.
There is no sharding, every server hosts the whole collection.

Requests are evenly distributed by a Varnish system.

During some peaks of requests, we see one server of the cluster having very high load while the two others are totally fine.
The server experiencing this high load is always the same until we reboot it and the behavior moves to another server.
The server experiencing the issue is not necessarily the leader.
All servers receive the same number of requests per seconds.

Load data:
- Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having issues)
- Server2: 5% CPU when low QPS, 25% CPU when high QPS
- Server3: 5% CPU when low QPS, 20% CPU when high QPS

What could explain this behavior in SolrCloud mechanisms?

Thank you for reading,

Gaël Jourdan-Weil
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud one server with high load

kamaci
Hi Gaël,

Does all three servers have same specifications? On the other hand, is your
load balancing configuration for Varnish is round-robin?

Kind Regards,
Furkan KAMACI

On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
[hidden email]> wrote:

> Hello,
>
> I come again to the community for some ideas regarding a performance issue
> we are having.
>
> We have a SolrCloud cluster of 3 servers.
> Each server hosts 1 replica of 2 collections.
> There is no sharding, every server hosts the whole collection.
>
> Requests are evenly distributed by a Varnish system.
>
> During some peaks of requests, we see one server of the cluster having
> very high load while the two others are totally fine.
> The server experiencing this high load is always the same until we reboot
> it and the behavior moves to another server.
> The server experiencing the issue is not necessarily the leader.
> All servers receive the same number of requests per seconds.
>
> Load data:
> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
> issues)
> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
>
> What could explain this behavior in SolrCloud mechanisms?
>
> Thank you for reading,
>
> Gaël Jourdan-Weil
>
Reply | Threaded
Open this post in threaded view
|

RE: SolrCloud one server with high load

Gael Jourdan-Weil
Hello Furkan,

Yes the 3 servers have exact same configuration.

Varnish load balancing is effectively round robin.
We monitor the number of requests per second, and we effectively see the 3 servers are receiving same amount of requests.

Kind Regards,
Gaël

________________________________
De : Furkan KAMACI <[hidden email]>
Envoyé : lundi 4 mars 2019 15:00
À : [hidden email]
Objet : Re: SolrCloud one server with high load

Hi Gaël,

Does all three servers have same specifications? On the other hand, is your
load balancing configuration for Varnish is round-robin?

Kind Regards,
Furkan KAMACI

On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
[hidden email]> wrote:

> Hello,
>
> I come again to the community for some ideas regarding a performance issue
> we are having.
>
> We have a SolrCloud cluster of 3 servers.
> Each server hosts 1 replica of 2 collections.
> There is no sharding, every server hosts the whole collection.
>
> Requests are evenly distributed by a Varnish system.
>
> During some peaks of requests, we see one server of the cluster having
> very high load while the two others are totally fine.
> The server experiencing this high load is always the same until we reboot
> it and the behavior moves to another server.
> The server experiencing the issue is not necessarily the leader.
> All servers receive the same number of requests per seconds.
>
> Load data:
> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
> issues)
> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
>
> What could explain this behavior in SolrCloud mechanisms?
>
> Thank you for reading,
>
> Gaël Jourdan-Weil
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud one server with high load

Erick Erickson
What version of Solr? There are some anecdotal reports of abnormal CPU loads on very recent Solr’s.

Is the server with the high load the “Overseer”? In the admin UI>>SolrCloud>>tree you can see which node is the Overseer. This is really a shot in the dark, as unless you are doing a lot of collection maintenance operations, the Overseer shouldn’t be doing much really.

There is _one_ Overseer per cluster and it’s in charge of coordinating changes to ZooKeeper.

If there’s a correlation there, it’d be great to know. It’s possible to move the Overseer to a different node, one that’s running Solr but not necessarily hosting any replicas. This isn’t a permanent solution, but would help isolate the issue.

First, let’s see if the not node is always the Overseer...

Best,
Erick

> On Mar 4, 2019, at 6:51 AM, Gael Jourdan-Weil <[hidden email]> wrote:
>
> Hello Furkan,
>
> Yes the 3 servers have exact same configuration.
>
> Varnish load balancing is effectively round robin.
> We monitor the number of requests per second, and we effectively see the 3 servers are receiving same amount of requests.
>
> Kind Regards,
> Gaël
>
> ________________________________
> De : Furkan KAMACI <[hidden email]>
> Envoyé : lundi 4 mars 2019 15:00
> À : [hidden email]
> Objet : Re: SolrCloud one server with high load
>
> Hi Gaël,
>
> Does all three servers have same specifications? On the other hand, is your
> load balancing configuration for Varnish is round-robin?
>
> Kind Regards,
> Furkan KAMACI
>
> On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
> [hidden email]> wrote:
>
>> Hello,
>>
>> I come again to the community for some ideas regarding a performance issue
>> we are having.
>>
>> We have a SolrCloud cluster of 3 servers.
>> Each server hosts 1 replica of 2 collections.
>> There is no sharding, every server hosts the whole collection.
>>
>> Requests are evenly distributed by a Varnish system.
>>
>> During some peaks of requests, we see one server of the cluster having
>> very high load while the two others are totally fine.
>> The server experiencing this high load is always the same until we reboot
>> it and the behavior moves to another server.
>> The server experiencing the issue is not necessarily the leader.
>> All servers receive the same number of requests per seconds.
>>
>> Load data:
>> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
>> issues)
>> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
>> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
>>
>> What could explain this behavior in SolrCloud mechanisms?
>>
>> Thank you for reading,
>>
>> Gaël Jourdan-Weil
>>

Reply | Threaded
Open this post in threaded view
|

RE: SolrCloud one server with high load

Gael Jourdan-Weil
Hi Erick,

We are running Solr 7.6.0.
We recently upgraded from 7.2.1 but we already had theses issues with Solr 7.2.1.

Is the overseer different from the leader?
In the Solr Admin UI > SolrCloud > Tree > overseer > leader file I can see the machine being the leader is not the one having issues right now.

Kind Regards,
Gaël

________________________________
De : Erick Erickson <[hidden email]>
Envoyé : lundi 4 mars 2019 17:57
À : [hidden email]
Objet : Re: SolrCloud one server with high load

What version of Solr? There are some anecdotal reports of abnormal CPU loads on very recent Solr’s.

Is the server with the high load the “Overseer”? In the admin UI>>SolrCloud>>tree you can see which node is the Overseer. This is really a shot in the dark, as unless you are doing a lot of collection maintenance operations, the Overseer shouldn’t be doing much really.

There is _one_ Overseer per cluster and it’s in charge of coordinating changes to ZooKeeper.

If there’s a correlation there, it’d be great to know. It’s possible to move the Overseer to a different node, one that’s running Solr but not necessarily hosting any replicas. This isn’t a permanent solution, but would help isolate the issue.

First, let’s see if the not node is always the Overseer...

Best,
Erick

> On Mar 4, 2019, at 6:51 AM, Gael Jourdan-Weil <[hidden email]> wrote:
>
> Hello Furkan,
>
> Yes the 3 servers have exact same configuration.
>
> Varnish load balancing is effectively round robin.
> We monitor the number of requests per second, and we effectively see the 3 servers are receiving same amount of requests.
>
> Kind Regards,
> Gaël
>
> ________________________________
> De : Furkan KAMACI <[hidden email]>
> Envoyé : lundi 4 mars 2019 15:00
> À : [hidden email]
> Objet : Re: SolrCloud one server with high load
>
> Hi Gaël,
>
> Does all three servers have same specifications? On the other hand, is your
> load balancing configuration for Varnish is round-robin?
>
> Kind Regards,
> Furkan KAMACI
>
> On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
> [hidden email]> wrote:
>
>> Hello,
>>
>> I come again to the community for some ideas regarding a performance issue
>> we are having.
>>
>> We have a SolrCloud cluster of 3 servers.
>> Each server hosts 1 replica of 2 collections.
>> There is no sharding, every server hosts the whole collection.
>>
>> Requests are evenly distributed by a Varnish system.
>>
>> During some peaks of requests, we see one server of the cluster having
>> very high load while the two others are totally fine.
>> The server experiencing this high load is always the same until we reboot
>> it and the behavior moves to another server.
>> The server experiencing the issue is not necessarily the leader.
>> All servers receive the same number of requests per seconds.
>>
>> Load data:
>> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
>> issues)
>> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
>> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
>>
>> What could explain this behavior in SolrCloud mechanisms?
>>
>> Thank you for reading,
>>
>> Gaël Jourdan-Weil
>>

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud one server with high load

ssedume
Reinstall it. If the hardware is the same.

On Mon, 4 Mar 2019, 20:45 Gael Jourdan-Weil, <
[hidden email]> wrote:

> Hi Erick,
>
> We are running Solr 7.6.0.
> We recently upgraded from 7.2.1 but we already had theses issues with Solr
> 7.2.1.
>
> Is the overseer different from the leader?
> In the Solr Admin UI > SolrCloud > Tree > overseer > leader file I can see
> the machine being the leader is not the one having issues right now.
>
> Kind Regards,
> Gaël
>
> ________________________________
> De : Erick Erickson <[hidden email]>
> Envoyé : lundi 4 mars 2019 17:57
> À : [hidden email]
> Objet : Re: SolrCloud one server with high load
>
> What version of Solr? There are some anecdotal reports of abnormal CPU
> loads on very recent Solr’s.
>
> Is the server with the high load the “Overseer”? In the admin
> UI>>SolrCloud>>tree you can see which node is the Overseer. This is really
> a shot in the dark, as unless you are doing a lot of collection maintenance
> operations, the Overseer shouldn’t be doing much really.
>
> There is _one_ Overseer per cluster and it’s in charge of coordinating
> changes to ZooKeeper.
>
> If there’s a correlation there, it’d be great to know. It’s possible to
> move the Overseer to a different node, one that’s running Solr but not
> necessarily hosting any replicas. This isn’t a permanent solution, but
> would help isolate the issue.
>
> First, let’s see if the not node is always the Overseer...
>
> Best,
> Erick
>
> > On Mar 4, 2019, at 6:51 AM, Gael Jourdan-Weil <
> [hidden email]> wrote:
> >
> > Hello Furkan,
> >
> > Yes the 3 servers have exact same configuration.
> >
> > Varnish load balancing is effectively round robin.
> > We monitor the number of requests per second, and we effectively see the
> 3 servers are receiving same amount of requests.
> >
> > Kind Regards,
> > Gaël
> >
> > ________________________________
> > De : Furkan KAMACI <[hidden email]>
> > Envoyé : lundi 4 mars 2019 15:00
> > À : [hidden email]
> > Objet : Re: SolrCloud one server with high load
> >
> > Hi Gaël,
> >
> > Does all three servers have same specifications? On the other hand, is
> your
> > load balancing configuration for Varnish is round-robin?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
> > [hidden email]> wrote:
> >
> >> Hello,
> >>
> >> I come again to the community for some ideas regarding a performance
> issue
> >> we are having.
> >>
> >> We have a SolrCloud cluster of 3 servers.
> >> Each server hosts 1 replica of 2 collections.
> >> There is no sharding, every server hosts the whole collection.
> >>
> >> Requests are evenly distributed by a Varnish system.
> >>
> >> During some peaks of requests, we see one server of the cluster having
> >> very high load while the two others are totally fine.
> >> The server experiencing this high load is always the same until we
> reboot
> >> it and the behavior moves to another server.
> >> The server experiencing the issue is not necessarily the leader.
> >> All servers receive the same number of requests per seconds.
> >>
> >> Load data:
> >> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
> >> issues)
> >> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
> >> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
> >>
> >> What could explain this behavior in SolrCloud mechanisms?
> >>
> >> Thank you for reading,
> >>
> >> Gaël Jourdan-Weil
> >>
>
>
If I have seen further its by standing on shoulders of giants
Reply | Threaded
Open this post in threaded view
|

RE: SolrCloud one server with high load

Gael Jourdan-Weil
In reply to this post by Gael Jourdan-Weil
Thanks for these informations, I didn't know this notion of Overseer yet.

When I say "high QPS", it's only queries. There is no document indexed at the time of this issue.

I have two thread dumps took at a time we were having the issue:
- one on the server having the issue (CPU~90%): https://pastebin.com/NeeSXj9B
- one on another server not having issues (CPU ~20%): https://pastebin.com/vgExMf4s
None of those 2 servers are the Overseer, nor the collection leader.

Apart from the fact that the server having issues has a lot more threads in RUNNABLE state while the other has its threads in TIMED_WAITING, I don't see anything but I'm not experienced at reading these dumps.

I had a look with a profiler on the server having issues.
During the high CPU, if I sort the method by "self time CPU" spent I get:
- org.apache.solr.search.BitDocSet.andNot -12%
- sun.nio.ch.ServerSocketChannelImpl.accept - 8.5%
- org.apache.solr.uninverting.FieldCacheImpl$LongsFromArray$1.longValue - 5.9%
- org.apache.lucene.util.FixedBitSet.clone - 5.9%
- java.util.HashSet.add - 5.4%
- org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.<init> - 4.1%
- org.apache.lucene.search.DisjunctionDISIApproximation.advance - 3.6%
- org.apache.lucene.codecs.lucene50.ForUtil.readBlock - 2.6%

I'll open a JIRA if you think it could be useful but I don't want to pollute JIRA with issues not fully qualified yet.

Kind Regards,
Gaël




________________________________
De : Erick Erickson <[hidden email]>
Envoyé : lundi 4 mars 2019 18:53
À : Gael Jourdan-Weil
Objet : Re: SolrCloud one server with high load

Yes, Overseer is different than leader.

There is one leader is _per shard_, and it’s job is to coordinate updates to the index for that shard.

The Overseer coordinates the updates to ZooKeeper, and there is one (and only one) per cluster, and a cluster can contain many collections and each collection can have many shards. So there may be an unbounded number of leaders….

Right, Solr 7.6 is one of the versions that has some anecdotal comments like this. If at all possible, could you take a thread dump of a node when it’s having this problem? Or, even better, put a profiler on it? It’d be invaluable to see where the time was being spent. If you can do either of those things, please open a JIRA and attach the output.

If you do raise a JIRA, please include the information that the Overseer isn’t the one having the problem

One other bit of info that’d be useful is whether when you say “high QPS”, is it all queries or are you adding documents too?

Best,
Erick

> On Mar 4, 2019, at 9:45 AM, Gael Jourdan-Weil <[hidden email]> wrote:
>
> Hi Erick,
>
> We are running Solr 7.6.0.
> We recently upgraded from 7.2.1 but we already had theses issues with Solr 7.2.1.
>
> Is the overseer different from the leader?
> In the Solr Admin UI > SolrCloud > Tree > overseer > leader file I can see the machine being the leader is not the one having issues right now.
>
> Kind Regards,
> Gaël
> De : Erick Erickson <[hidden email]>
> Envoyé : lundi 4 mars 2019 17:57
> À : [hidden email]
> Objet : Re: SolrCloud one server with high load
>
> What version of Solr? There are some anecdotal reports of abnormal CPU loads on very recent Solr’s.
>
> Is the server with the high load the “Overseer”? In the admin UI>>SolrCloud>>tree you can see which node is the Overseer. This is really a shot in the dark, as unless you are doing a lot of collection maintenance operations, the Overseer shouldn’t be doing much really.
>
> There is _one_ Overseer per cluster and it’s in charge of coordinating changes to ZooKeeper.
>
> If there’s a correlation there, it’d be great to know. It’s possible to move the Overseer to a different node, one that’s running Solr but not necessarily hosting any replicas. This isn’t a permanent solution, but would help isolate the issue.
>
> First, let’s see if the not node is always the Overseer...
>
> Best,
> Erick
>
> > On Mar 4, 2019, at 6:51 AM, Gael Jourdan-Weil <[hidden email]> wrote:
> >
> > Hello Furkan,
> >
> > Yes the 3 servers have exact same configuration.
> >
> > Varnish load balancing is effectively round robin.
> > We monitor the number of requests per second, and we effectively see the 3 servers are receiving same amount of requests.
> >
> > Kind Regards,
> > Gaël
> >
> > ________________________________
> > De : Furkan KAMACI <[hidden email]>
> > Envoyé : lundi 4 mars 2019 15:00
> > À : [hidden email]
> > Objet : Re: SolrCloud one server with high load
> >
> > Hi Gaël,
> >
> > Does all three servers have same specifications? On the other hand, is your
> > load balancing configuration for Varnish is round-robin?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Mon, Mar 4, 2019 at 3:18 PM Gael Jourdan-Weil <
> > [hidden email]> wrote:
> >
> >> Hello,
> >>
> >> I come again to the community for some ideas regarding a performance issue
> >> we are having.
> >>
> >> We have a SolrCloud cluster of 3 servers.
> >> Each server hosts 1 replica of 2 collections.
> >> There is no sharding, every server hosts the whole collection.
> >>
> >> Requests are evenly distributed by a Varnish system.
> >>
> >> During some peaks of requests, we see one server of the cluster having
> >> very high load while the two others are totally fine.
> >> The server experiencing this high load is always the same until we reboot
> >> it and the behavior moves to another server.
> >> The server experiencing the issue is not necessarily the leader.
> >> All servers receive the same number of requests per seconds.
> >>
> >> Load data:
> >> - Server1: 5% CPU when low QPS, 90% CPU when high QPS (this one having
> >> issues)
> >> - Server2: 5% CPU when low QPS, 25% CPU when high QPS
> >> - Server3: 5% CPU when low QPS, 20% CPU when high QPS
> >>
> >> What could explain this behavior in SolrCloud mechanisms?
> >>
> >> Thank you for reading,
> >>
> >> Gaël Jourdan-Weil
> >>