Delayed/waiting requests

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Delayed/waiting requests

Gael Jourdan-Weil
Hello,

We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.

Load: between 30 and 40 queries per second, constant over time of analysis

Symptoms: high response time over short period of time but quite frequently.
We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.

What we found out: just before response time increase, requests seems to be delayed.
That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.

Does someone can think of something that could explain this in the way Solr works ?
Or ideas to track down the root cause..

Solr version is 7.2.1.

Thanks for reading,
 
Gaël Jourdan-Weil    
Reply | Threaded
Open this post in threaded view
|

Re: Delayed/waiting requests

Hullegård, Jimi
Could be caused by garbage collection in the jvm.

https://wiki.apache.org/solr/SolrPerformanceProblems

Go down to the segment called “GC pause problems”

/Jimi

Sent from my iPhone

On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:

Hello,

We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.

Load: between 30 and 40 queries per second, constant over time of analysis

Symptoms: high response time over short period of time but quite frequently.
We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.

What we found out: just before response time increase, requests seems to be delayed.
That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.

Does someone can think of something that could explain this in the way Solr works ?
Or ideas to track down the root cause..

Solr version is 7.2.1.

Thanks for reading,

Gaël Jourdan-Weil

Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

RE: Delayed/waiting requests

Gael Jourdan-Weil
Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.

I will collect data over the week-end and look at it.


Thanks

________________________________
De : Hullegård, Jimi <[hidden email]>
Envoyé : vendredi 11 janvier 2019 03:46:02
À : [hidden email]
Objet : Re: Delayed/waiting requests

Could be caused by garbage collection in the jvm.

https://wiki.apache.org/solr/SolrPerformanceProblems

Go down to the segment called “GC pause problems”

/Jimi

Sent from my iPhone

On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:

Hello,

We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.

Load: between 30 and 40 queries per second, constant over time of analysis

Symptoms: high response time over short period of time but quite frequently.
We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.

What we found out: just before response time increase, requests seems to be delayed.
That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.

Does someone can think of something that could explain this in the way Solr works ?
Or ideas to track down the root cause..

Solr version is 7.2.1.

Thanks for reading,

Gaël Jourdan-Weil

Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

Re: Delayed/waiting requests

Erick Erickson
Jimi's comment is one of the very common culprits.

Autowarming is another. Are you indexing at the same
time? If so it could well be  you aren't autowarming and
the spikes are caused by using a new IndexSearcher
that has to read much of the index off disk when commits
happen. The "smoking gun" here would be if the spikes
correlate to your commits (soft or hard-with-opensearcher-true).

Best,
Erick

On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
>
> I will collect data over the week-end and look at it.
>
>
> Thanks
>
> ________________________________
> De : Hullegård, Jimi <[hidden email]>
> Envoyé : vendredi 11 janvier 2019 03:46:02
> À : [hidden email]
> Objet : Re: Delayed/waiting requests
>
> Could be caused by garbage collection in the jvm.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Go down to the segment called “GC pause problems”
>
> /Jimi
>
> Sent from my iPhone
>
> On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
>
> Hello,
>
> We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
>
> Load: between 30 and 40 queries per second, constant over time of analysis
>
> Symptoms: high response time over short period of time but quite frequently.
> We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
>
> What we found out: just before response time increase, requests seems to be delayed.
> That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
>
> Does someone can think of something that could explain this in the way Solr works ?
> Or ideas to track down the root cause..
>
> Solr version is 7.2.1.
>
> Thanks for reading,
>
> Gaël Jourdan-Weil
>
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

RE: Delayed/waiting requests

Gael Jourdan-Weil
I had a look to GC logs this morning but I'm not sure how to interpret them.


Over a period of 54mn, there is:

- Number of pauses: 2739

- Accumulated pauses: 93s => that is 2.86% of the time

- Average pause duration: 0.03s

- Average pause interval: 1.18s

- Accumulated full GC: 0

I'm not sure if this is a lot or not. What do you think ?


Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.


Also we are indeed indexing at the same time but we have autowarming set.

I don't see any Searcher opened at the time we experience slowness.

Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.

Could this have any downside?


Thanks,

Gaël


________________________________
De : Erick Erickson <[hidden email]>
Envoyé : vendredi 11 janvier 2019 17:21
À : solr-user
Objet : Re: Delayed/waiting requests

Jimi's comment is one of the very common culprits.

Autowarming is another. Are you indexing at the same
time? If so it could well be  you aren't autowarming and
the spikes are caused by using a new IndexSearcher
that has to read much of the index off disk when commits
happen. The "smoking gun" here would be if the spikes
correlate to your commits (soft or hard-with-opensearcher-true).

Best,
Erick

On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
>
> I will collect data over the week-end and look at it.
>
>
> Thanks
>
> ________________________________
> De : Hullegård, Jimi <[hidden email]>
> Envoyé : vendredi 11 janvier 2019 03:46:02
> À : [hidden email]
> Objet : Re: Delayed/waiting requests
>
> Could be caused by garbage collection in the jvm.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Go down to the segment called “GC pause problems”
>
> /Jimi
>
> Sent from my iPhone
>
> On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
>
> Hello,
>
> We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
>
> Load: between 30 and 40 queries per second, constant over time of analysis
>
> Symptoms: high response time over short period of time but quite frequently.
> We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
>
> What we found out: just before response time increase, requests seems to be delayed.
> That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
>
> Does someone can think of something that could explain this in the way Solr works ?
> Or ideas to track down the root cause..
>
> Solr version is 7.2.1.
>
> Thanks for reading,
>
> Gaël Jourdan-Weil
>
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

Re: Delayed/waiting requests

Erick Erickson
Gael:

bq. Nevertheless, our filterCache is set to autowarm 12k entries which
is also the maxSize

That is far, far, far too many. Let's assume you actually have 12K
entries in the filterCache.
Every time you open a new searcher, 12K queries are executed _before_
the searcher
accepts any new requests. While being able to re-use a filterCache
entry is useful, one of
the primary purposes is to pre-load index data from disk into memory
which can be
the event that takes the most time.

The queryResultCache has a similar function. I often find that this
cache doesn't have a
very high hit ratio, but again executing a _few_ of these queries
warms the index from
disk.

I think of both caches as a map, where the key is the "thing", (fq
clause in the case
of filterCache, the whole query in the case of the queryResultCache).
Autowarming
replays the most recently executed N of these entries, essentially
just as though
they were submitted by a user.

Hypothesis: You're massively over-warming, and when that kicks in you're seeing
increased CPU and GC pressure leading to the anomalies you're seeing. Further,
you have such excessive autowarming going on that it's hard to see the
associated messages in the log.

Here's what I'd recommend: Set your autowarm counts to something on the order
of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
be much less severe. It _might_ be that your users see some increased (very
temporary) variance in response time. You can tell that the autowarming
configurations are "more art than science", I can't give you any other
recommendations than "start small and increase until you're happy"
unfortunately.

I usually do this with some kind of load tester in a dev lab of course ;).

Finally, if you use the metrics data (see:
https://lucene.apache.org/solr/guide/7_1/metrics-reporting.html)
you can see the autowarm times. Don't get too lost in the page to
start, just hit the "http://localhost:8983/solr/admin/metrics" endpoint
and look for "warmupTime", then refine on how to get _only_
the warmup stats ;).

Best,
Erick

On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> I had a look to GC logs this morning but I'm not sure how to interpret them.
>
>
> Over a period of 54mn, there is:
>
> - Number of pauses: 2739
>
> - Accumulated pauses: 93s => that is 2.86% of the time
>
> - Average pause duration: 0.03s
>
> - Average pause interval: 1.18s
>
> - Accumulated full GC: 0
>
> I'm not sure if this is a lot or not. What do you think ?
>
>
> Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
>
>
> Also we are indeed indexing at the same time but we have autowarming set.
>
> I don't see any Searcher opened at the time we experience slowness.
>
> Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
>
> Could this have any downside?
>
>
> Thanks,
>
> Gaël
>
>
> ________________________________
> De : Erick Erickson <[hidden email]>
> Envoyé : vendredi 11 janvier 2019 17:21
> À : solr-user
> Objet : Re: Delayed/waiting requests
>
> Jimi's comment is one of the very common culprits.
>
> Autowarming is another. Are you indexing at the same
> time? If so it could well be  you aren't autowarming and
> the spikes are caused by using a new IndexSearcher
> that has to read much of the index off disk when commits
> happen. The "smoking gun" here would be if the spikes
> correlate to your commits (soft or hard-with-opensearcher-true).
>
> Best,
> Erick
>
> On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
> <[hidden email]> wrote:
> >
> > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
> >
> > I will collect data over the week-end and look at it.
> >
> >
> > Thanks
> >
> > ________________________________
> > De : Hullegård, Jimi <[hidden email]>
> > Envoyé : vendredi 11 janvier 2019 03:46:02
> > À : [hidden email]
> > Objet : Re: Delayed/waiting requests
> >
> > Could be caused by garbage collection in the jvm.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Go down to the segment called “GC pause problems”
> >
> > /Jimi
> >
> > Sent from my iPhone
> >
> > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
> >
> > Hello,
> >
> > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
> >
> > Load: between 30 and 40 queries per second, constant over time of analysis
> >
> > Symptoms: high response time over short period of time but quite frequently.
> > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
> >
> > What we found out: just before response time increase, requests seems to be delayed.
> > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
> >
> > Does someone can think of something that could explain this in the way Solr works ?
> > Or ideas to track down the root cause..
> >
> > Solr version is 7.2.1.
> >
> > Thanks for reading,
> >
> > Gaël Jourdan-Weil
> >
> > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

RE: Delayed/waiting requests

Gael Jourdan-Weil
Hi Erick,


Thank you for your detailed answer, I better understand autowarming.


We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).

We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.


The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?

But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.


Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.


Regards,

________________________________
De : Erick Erickson <[hidden email]>
Envoyé : lundi 14 janvier 2019 17:44:38
À : solr-user
Objet : Re: Delayed/waiting requests

Gael:

bq. Nevertheless, our filterCache is set to autowarm 12k entries which
is also the maxSize

That is far, far, far too many. Let's assume you actually have 12K
entries in the filterCache.
Every time you open a new searcher, 12K queries are executed _before_
the searcher
accepts any new requests. While being able to re-use a filterCache
entry is useful, one of
the primary purposes is to pre-load index data from disk into memory
which can be
the event that takes the most time.

The queryResultCache has a similar function. I often find that this
cache doesn't have a
very high hit ratio, but again executing a _few_ of these queries
warms the index from
disk.

I think of both caches as a map, where the key is the "thing", (fq
clause in the case
of filterCache, the whole query in the case of the queryResultCache).
Autowarming
replays the most recently executed N of these entries, essentially
just as though
they were submitted by a user.

Hypothesis: You're massively over-warming, and when that kicks in you're seeing
increased CPU and GC pressure leading to the anomalies you're seeing. Further,
you have such excessive autowarming going on that it's hard to see the
associated messages in the log.

Here's what I'd recommend: Set your autowarm counts to something on the order
of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
be much less severe. It _might_ be that your users see some increased (very
temporary) variance in response time. You can tell that the autowarming
configurations are "more art than science", I can't give you any other
recommendations than "start small and increase until you're happy"
unfortunately.

I usually do this with some kind of load tester in a dev lab of course ;).

Finally, if you use the metrics data (see:
https://lucene.apache.org/solr/guide/7_1/metrics-reporting.html)
you can see the autowarm times. Don't get too lost in the page to
start, just hit the "http://localhost:8983/solr/admin/metrics" endpoint
and look for "warmupTime", then refine on how to get _only_
the warmup stats ;).

Best,
Erick

On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> I had a look to GC logs this morning but I'm not sure how to interpret them.
>
>
> Over a period of 54mn, there is:
>
> - Number of pauses: 2739
>
> - Accumulated pauses: 93s => that is 2.86% of the time
>
> - Average pause duration: 0.03s
>
> - Average pause interval: 1.18s
>
> - Accumulated full GC: 0
>
> I'm not sure if this is a lot or not. What do you think ?
>
>
> Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
>
>
> Also we are indeed indexing at the same time but we have autowarming set.
>
> I don't see any Searcher opened at the time we experience slowness.
>
> Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
>
> Could this have any downside?
>
>
> Thanks,
>
> Gaël
>
>
> ________________________________
> De : Erick Erickson <[hidden email]>
> Envoyé : vendredi 11 janvier 2019 17:21
> À : solr-user
> Objet : Re: Delayed/waiting requests
>
> Jimi's comment is one of the very common culprits.
>
> Autowarming is another. Are you indexing at the same
> time? If so it could well be  you aren't autowarming and
> the spikes are caused by using a new IndexSearcher
> that has to read much of the index off disk when commits
> happen. The "smoking gun" here would be if the spikes
> correlate to your commits (soft or hard-with-opensearcher-true).
>
> Best,
> Erick
>
> On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
> <[hidden email]> wrote:
> >
> > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
> >
> > I will collect data over the week-end and look at it.
> >
> >
> > Thanks
> >
> > ________________________________
> > De : Hullegård, Jimi <[hidden email]>
> > Envoyé : vendredi 11 janvier 2019 03:46:02
> > À : [hidden email]
> > Objet : Re: Delayed/waiting requests
> >
> > Could be caused by garbage collection in the jvm.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Go down to the segment called “GC pause problems”
> >
> > /Jimi
> >
> > Sent from my iPhone
> >
> > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
> >
> > Hello,
> >
> > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
> >
> > Load: between 30 and 40 queries per second, constant over time of analysis
> >
> > Symptoms: high response time over short period of time but quite frequently.
> > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
> >
> > What we found out: just before response time increase, requests seems to be delayed.
> > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
> >
> > Does someone can think of something that could explain this in the way Solr works ?
> > Or ideas to track down the root cause..
> >
> > Solr version is 7.2.1.
> >
> > Thanks for reading,
> >
> > Gaël Jourdan-Weil
> >
> > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

Re: Delayed/waiting requests

Erick Erickson
Well, it was a nice theory anyway.

"Other collections with the same settings"
doesn't really mean much unless those other collections are very similar,
especially in terms of numbers of docs.

You should only see a new searcher opening when you do a
hard-commit-with-opensearcher-true or soft commit.

So what happens when you just try lowering the autowarm
count? I'm assuming you're free to test in some non-prod
system.

Focusing on the hit ratio is something of a red herring. Remember
that each entry in your filterCache is roughly maxDoc/8 + a little
overhead, the increase in GC pressure has to be balanced
against getting the hits from the cache.

Now, all that said if there's no correlation, then you need to put
a profiler on the system when you see this kind of thing and
find out where the hotspots are, otherwise it's guesswork and
I'm out of ideas.

Best,
Erick

On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> Hi Erick,
>
>
> Thank you for your detailed answer, I better understand autowarming.
>
>
> We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
>
> We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
>
>
> The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
>
> But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
>
>
> Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
>
>
> Regards,
>
> ________________________________
> De : Erick Erickson <[hidden email]>
> Envoyé : lundi 14 janvier 2019 17:44:38
> À : solr-user
> Objet : Re: Delayed/waiting requests
>
> Gael:
>
> bq. Nevertheless, our filterCache is set to autowarm 12k entries which
> is also the maxSize
>
> That is far, far, far too many. Let's assume you actually have 12K
> entries in the filterCache.
> Every time you open a new searcher, 12K queries are executed _before_
> the searcher
> accepts any new requests. While being able to re-use a filterCache
> entry is useful, one of
> the primary purposes is to pre-load index data from disk into memory
> which can be
> the event that takes the most time.
>
> The queryResultCache has a similar function. I often find that this
> cache doesn't have a
> very high hit ratio, but again executing a _few_ of these queries
> warms the index from
> disk.
>
> I think of both caches as a map, where the key is the "thing", (fq
> clause in the case
> of filterCache, the whole query in the case of the queryResultCache).
> Autowarming
> replays the most recently executed N of these entries, essentially
> just as though
> they were submitted by a user.
>
> Hypothesis: You're massively over-warming, and when that kicks in you're seeing
> increased CPU and GC pressure leading to the anomalies you're seeing. Further,
> you have such excessive autowarming going on that it's hard to see the
> associated messages in the log.
>
> Here's what I'd recommend: Set your autowarm counts to something on the order
> of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
> be much less severe. It _might_ be that your users see some increased (very
> temporary) variance in response time. You can tell that the autowarming
> configurations are "more art than science", I can't give you any other
> recommendations than "start small and increase until you're happy"
> unfortunately.
>
> I usually do this with some kind of load tester in a dev lab of course ;).
>
> Finally, if you use the metrics data (see:
> https://lucene.apache.org/solr/guide/7_1/metrics-reporting.html)
> you can see the autowarm times. Don't get too lost in the page to
> start, just hit the "http://localhost:8983/solr/admin/metrics" endpoint
> and look for "warmupTime", then refine on how to get _only_
> the warmup stats ;).
>
> Best,
> Erick
>
> On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
> <[hidden email]> wrote:
> >
> > I had a look to GC logs this morning but I'm not sure how to interpret them.
> >
> >
> > Over a period of 54mn, there is:
> >
> > - Number of pauses: 2739
> >
> > - Accumulated pauses: 93s => that is 2.86% of the time
> >
> > - Average pause duration: 0.03s
> >
> > - Average pause interval: 1.18s
> >
> > - Accumulated full GC: 0
> >
> > I'm not sure if this is a lot or not. What do you think ?
> >
> >
> > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
> >
> >
> > Also we are indeed indexing at the same time but we have autowarming set.
> >
> > I don't see any Searcher opened at the time we experience slowness.
> >
> > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
> >
> > Could this have any downside?
> >
> >
> > Thanks,
> >
> > Gaël
> >
> >
> > ________________________________
> > De : Erick Erickson <[hidden email]>
> > Envoyé : vendredi 11 janvier 2019 17:21
> > À : solr-user
> > Objet : Re: Delayed/waiting requests
> >
> > Jimi's comment is one of the very common culprits.
> >
> > Autowarming is another. Are you indexing at the same
> > time? If so it could well be  you aren't autowarming and
> > the spikes are caused by using a new IndexSearcher
> > that has to read much of the index off disk when commits
> > happen. The "smoking gun" here would be if the spikes
> > correlate to your commits (soft or hard-with-opensearcher-true).
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
> > <[hidden email]> wrote:
> > >
> > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
> > >
> > > I will collect data over the week-end and look at it.
> > >
> > >
> > > Thanks
> > >
> > > ________________________________
> > > De : Hullegård, Jimi <[hidden email]>
> > > Envoyé : vendredi 11 janvier 2019 03:46:02
> > > À : [hidden email]
> > > Objet : Re: Delayed/waiting requests
> > >
> > > Could be caused by garbage collection in the jvm.
> > >
> > > https://wiki.apache.org/solr/SolrPerformanceProblems
> > >
> > > Go down to the segment called “GC pause problems”
> > >
> > > /Jimi
> > >
> > > Sent from my iPhone
> > >
> > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
> > >
> > > Hello,
> > >
> > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
> > >
> > > Load: between 30 and 40 queries per second, constant over time of analysis
> > >
> > > Symptoms: high response time over short period of time but quite frequently.
> > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
> > >
> > > What we found out: just before response time increase, requests seems to be delayed.
> > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
> > >
> > > Does someone can think of something that could explain this in the way Solr works ?
> > > Or ideas to track down the root cause..
> > >
> > > Solr version is 7.2.1.
> > >
> > > Thanks for reading,
> > >
> > > Gaël Jourdan-Weil
> > >
> > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email&utm_medium=email>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Delayed/waiting requests

Branham, Jeremy (Experis)
Hi Gael –

Could you share this information?
Size of the index
Server memory available
Server CPU count
JVM memory settings

You mentioned a cloud configuration of 3 replicas.
Does that mean you have 1 shard with a replication factor of 3?
Do the pauses occur on all 3 servers?
Is the traffic evenly balanced across those servers?

 
Jeremy Branham
[hidden email]


On 1/15/19, 9:50 AM, "Erick Erickson" <[hidden email]> wrote:

    Well, it was a nice theory anyway.
   
    "Other collections with the same settings"
    doesn't really mean much unless those other collections are very similar,
    especially in terms of numbers of docs.
   
    You should only see a new searcher opening when you do a
    hard-commit-with-opensearcher-true or soft commit.
   
    So what happens when you just try lowering the autowarm
    count? I'm assuming you're free to test in some non-prod
    system.
   
    Focusing on the hit ratio is something of a red herring. Remember
    that each entry in your filterCache is roughly maxDoc/8 + a little
    overhead, the increase in GC pressure has to be balanced
    against getting the hits from the cache.
   
    Now, all that said if there's no correlation, then you need to put
    a profiler on the system when you see this kind of thing and
    find out where the hotspots are, otherwise it's guesswork and
    I'm out of ideas.
   
    Best,
    Erick
   
    On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
    <[hidden email]> wrote:
    >
    > Hi Erick,
    >
    >
    > Thank you for your detailed answer, I better understand autowarming.
    >
    >
    > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
    >
    > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
    >
    >
    > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
    >
    > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
    >
    >
    > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
    >
    >
    > Regards,
    >
    > ________________________________
    > De : Erick Erickson <[hidden email]>
    > Envoyé : lundi 14 janvier 2019 17:44:38
    > À : solr-user
    > Objet : Re: Delayed/waiting requests
    >
    > Gael:
    >
    > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
    > is also the maxSize
    >
    > That is far, far, far too many. Let's assume you actually have 12K
    > entries in the filterCache.
    > Every time you open a new searcher, 12K queries are executed _before_
    > the searcher
    > accepts any new requests. While being able to re-use a filterCache
    > entry is useful, one of
    > the primary purposes is to pre-load index data from disk into memory
    > which can be
    > the event that takes the most time.
    >
    > The queryResultCache has a similar function. I often find that this
    > cache doesn't have a
    > very high hit ratio, but again executing a _few_ of these queries
    > warms the index from
    > disk.
    >
    > I think of both caches as a map, where the key is the "thing", (fq
    > clause in the case
    > of filterCache, the whole query in the case of the queryResultCache).
    > Autowarming
    > replays the most recently executed N of these entries, essentially
    > just as though
    > they were submitted by a user.
    >
    > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
    > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
    > you have such excessive autowarming going on that it's hard to see the
    > associated messages in the log.
    >
    > Here's what I'd recommend: Set your autowarm counts to something on the order
    > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
    > be much less severe. It _might_ be that your users see some increased (very
    > temporary) variance in response time. You can tell that the autowarming
    > configurations are "more art than science", I can't give you any other
    > recommendations than "start small and increase until you're happy"
    > unfortunately.
    >
    > I usually do this with some kind of load tester in a dev lab of course ;).
    >
    > Finally, if you use the metrics data (see:
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
    > you can see the autowarm times. Don't get too lost in the page to
    > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
    > and look for "warmupTime", then refine on how to get _only_
    > the warmup stats ;).
    >
    > Best,
    > Erick
    >
    > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
    > <[hidden email]> wrote:
    > >
    > > I had a look to GC logs this morning but I'm not sure how to interpret them.
    > >
    > >
    > > Over a period of 54mn, there is:
    > >
    > > - Number of pauses: 2739
    > >
    > > - Accumulated pauses: 93s => that is 2.86% of the time
    > >
    > > - Average pause duration: 0.03s
    > >
    > > - Average pause interval: 1.18s
    > >
    > > - Accumulated full GC: 0
    > >
    > > I'm not sure if this is a lot or not. What do you think ?
    > >
    > >
    > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
    > >
    > >
    > > Also we are indeed indexing at the same time but we have autowarming set.
    > >
    > > I don't see any Searcher opened at the time we experience slowness.
    > >
    > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
    > >
    > > Could this have any downside?
    > >
    > >
    > > Thanks,
    > >
    > > Gaël
    > >
    > >
    > > ________________________________
    > > De : Erick Erickson <[hidden email]>
    > > Envoyé : vendredi 11 janvier 2019 17:21
    > > À : solr-user
    > > Objet : Re: Delayed/waiting requests
    > >
    > > Jimi's comment is one of the very common culprits.
    > >
    > > Autowarming is another. Are you indexing at the same
    > > time? If so it could well be  you aren't autowarming and
    > > the spikes are caused by using a new IndexSearcher
    > > that has to read much of the index off disk when commits
    > > happen. The "smoking gun" here would be if the spikes
    > > correlate to your commits (soft or hard-with-opensearcher-true).
    > >
    > > Best,
    > > Erick
    > >
    > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
    > > <[hidden email]> wrote:
    > > >
    > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
    > > >
    > > > I will collect data over the week-end and look at it.
    > > >
    > > >
    > > > Thanks
    > > >
    > > > ________________________________
    > > > De : Hullegård, Jimi <[hidden email]>
    > > > Envoyé : vendredi 11 janvier 2019 03:46:02
    > > > À : [hidden email]
    > > > Objet : Re: Delayed/waiting requests
    > > >
    > > > Could be caused by garbage collection in the jvm.
    > > >
    > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
    > > >
    > > > Go down to the segment called “GC pause problems”
    > > >
    > > > /Jimi
    > > >
    > > > Sent from my iPhone
    > > >
    > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
    > > >
    > > > Hello,
    > > >
    > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
    > > >
    > > > Load: between 30 and 40 queries per second, constant over time of analysis
    > > >
    > > > Symptoms: high response time over short period of time but quite frequently.
    > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
    > > >
    > > > What we found out: just before response time increase, requests seems to be delayed.
    > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
    > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
    > > >
    > > > Does someone can think of something that could explain this in the way Solr works ?
    > > > Or ideas to track down the root cause..
    > > >
    > > > Solr version is 7.2.1.
    > > >
    > > > Thanks for reading,
    > > >
    > > > Gaël Jourdan-Weil
    > > >
    > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>
   

Reply | Threaded
Open this post in threaded view
|

RE: Re: Delayed/waiting requests

Gael Jourdan-Weil
@Erick:


We will try to lower the autowarm and run some tests to compare.

If I get your point, having a big cache might cause more troubles than help if the cache hit ratio is not high enough because the cache is constantly evicting/inserting entries?



@Jeremy:


Index size: ~20G and ~14M documents

Server memory available: 256G from which ~30G used and ~100G system cache

Server CPU count: 32, ~10% usage

JVM memory settings: -Xms12G -Xmx12G


We have 3 servers and 3 clusters of 3 Solr instances.

That is each server hosts 1 Solr instance for each cluster.

And, indeed, each cluster only has 1 shard with replication factor 3.


Among all these Solr instances, the pauses are observed on only one single cluster but on every server at different times (sometimes on all servers at the same time but I would say it's very rare).

We do observe the traffic is evenly balanced across the 3 servers, around 30-40 queries per second sent to each server.



Regards,

Gaël


________________________________
De : Branham, Jeremy (Experis) <[hidden email]>
Envoyé : mardi 15 janvier 2019 17:59:56
À : [hidden email]
Objet : Re: Re: Delayed/waiting requests

Hi Gael –

Could you share this information?
Size of the index
Server memory available
Server CPU count
JVM memory settings

You mentioned a cloud configuration of 3 replicas.
Does that mean you have 1 shard with a replication factor of 3?
Do the pauses occur on all 3 servers?
Is the traffic evenly balanced across those servers?


Jeremy Branham
[hidden email]


On 1/15/19, 9:50 AM, "Erick Erickson" <[hidden email]> wrote:

    Well, it was a nice theory anyway.

    "Other collections with the same settings"
    doesn't really mean much unless those other collections are very similar,
    especially in terms of numbers of docs.

    You should only see a new searcher opening when you do a
    hard-commit-with-opensearcher-true or soft commit.

    So what happens when you just try lowering the autowarm
    count? I'm assuming you're free to test in some non-prod
    system.

    Focusing on the hit ratio is something of a red herring. Remember
    that each entry in your filterCache is roughly maxDoc/8 + a little
    overhead, the increase in GC pressure has to be balanced
    against getting the hits from the cache.

    Now, all that said if there's no correlation, then you need to put
    a profiler on the system when you see this kind of thing and
    find out where the hotspots are, otherwise it's guesswork and
    I'm out of ideas.

    Best,
    Erick

    On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
    <[hidden email]> wrote:
    >
    > Hi Erick,
    >
    >
    > Thank you for your detailed answer, I better understand autowarming.
    >
    >
    > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
    >
    > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
    >
    >
    > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
    >
    > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
    >
    >
    > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
    >
    >
    > Regards,
    >
    > ________________________________
    > De : Erick Erickson <[hidden email]>
    > Envoyé : lundi 14 janvier 2019 17:44:38
    > À : solr-user
    > Objet : Re: Delayed/waiting requests
    >
    > Gael:
    >
    > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
    > is also the maxSize
    >
    > That is far, far, far too many. Let's assume you actually have 12K
    > entries in the filterCache.
    > Every time you open a new searcher, 12K queries are executed _before_
    > the searcher
    > accepts any new requests. While being able to re-use a filterCache
    > entry is useful, one of
    > the primary purposes is to pre-load index data from disk into memory
    > which can be
    > the event that takes the most time.
    >
    > The queryResultCache has a similar function. I often find that this
    > cache doesn't have a
    > very high hit ratio, but again executing a _few_ of these queries
    > warms the index from
    > disk.
    >
    > I think of both caches as a map, where the key is the "thing", (fq
    > clause in the case
    > of filterCache, the whole query in the case of the queryResultCache).
    > Autowarming
    > replays the most recently executed N of these entries, essentially
    > just as though
    > they were submitted by a user.
    >
    > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
    > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
    > you have such excessive autowarming going on that it's hard to see the
    > associated messages in the log.
    >
    > Here's what I'd recommend: Set your autowarm counts to something on the order
    > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
    > be much less severe. It _might_ be that your users see some increased (very
    > temporary) variance in response time. You can tell that the autowarming
    > configurations are "more art than science", I can't give you any other
    > recommendations than "start small and increase until you're happy"
    > unfortunately.
    >
    > I usually do this with some kind of load tester in a dev lab of course ;).
    >
    > Finally, if you use the metrics data (see:
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
    > you can see the autowarm times. Don't get too lost in the page to
    > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
    > and look for "warmupTime", then refine on how to get _only_
    > the warmup stats ;).
    >
    > Best,
    > Erick
    >
    > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
    > <[hidden email]> wrote:
    > >
    > > I had a look to GC logs this morning but I'm not sure how to interpret them.
    > >
    > >
    > > Over a period of 54mn, there is:
    > >
    > > - Number of pauses: 2739
    > >
    > > - Accumulated pauses: 93s => that is 2.86% of the time
    > >
    > > - Average pause duration: 0.03s
    > >
    > > - Average pause interval: 1.18s
    > >
    > > - Accumulated full GC: 0
    > >
    > > I'm not sure if this is a lot or not. What do you think ?
    > >
    > >
    > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
    > >
    > >
    > > Also we are indeed indexing at the same time but we have autowarming set.
    > >
    > > I don't see any Searcher opened at the time we experience slowness.
    > >
    > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
    > >
    > > Could this have any downside?
    > >
    > >
    > > Thanks,
    > >
    > > Gaël
    > >
    > >
    > > ________________________________
    > > De : Erick Erickson <[hidden email]>
    > > Envoyé : vendredi 11 janvier 2019 17:21
    > > À : solr-user
    > > Objet : Re: Delayed/waiting requests
    > >
    > > Jimi's comment is one of the very common culprits.
    > >
    > > Autowarming is another. Are you indexing at the same
    > > time? If so it could well be  you aren't autowarming and
    > > the spikes are caused by using a new IndexSearcher
    > > that has to read much of the index off disk when commits
    > > happen. The "smoking gun" here would be if the spikes
    > > correlate to your commits (soft or hard-with-opensearcher-true).
    > >
    > > Best,
    > > Erick
    > >
    > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
    > > <[hidden email]> wrote:
    > > >
    > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
    > > >
    > > > I will collect data over the week-end and look at it.
    > > >
    > > >
    > > > Thanks
    > > >
    > > > ________________________________
    > > > De : Hullegård, Jimi <[hidden email]>
    > > > Envoyé : vendredi 11 janvier 2019 03:46:02
    > > > À : [hidden email]
    > > > Objet : Re: Delayed/waiting requests
    > > >
    > > > Could be caused by garbage collection in the jvm.
    > > >
    > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
    > > >
    > > > Go down to the segment called “GC pause problems”
    > > >
    > > > /Jimi
    > > >
    > > > Sent from my iPhone
    > > >
    > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
    > > >
    > > > Hello,
    > > >
    > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
    > > >
    > > > Load: between 30 and 40 queries per second, constant over time of analysis
    > > >
    > > > Symptoms: high response time over short period of time but quite frequently.
    > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
    > > >
    > > > What we found out: just before response time increase, requests seems to be delayed.
    > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
    > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
    > > >
    > > > Does someone can think of something that could explain this in the way Solr works ?
    > > > Or ideas to track down the root cause..
    > > >
    > > > Solr version is 7.2.1.
    > > >
    > > > Thanks for reading,
    > > >
    > > > Gaël Jourdan-Weil
    > > >
    > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>


Reply | Threaded
Open this post in threaded view
|

Re: Re: Delayed/waiting requests

Erick Erickson
bq. If I get your point, having a big cache might cause more troubles
than help if the cache hit ratio is not high enough because the cache
is constantly evicting/inserting entries?

Pretty much. Although there are nuances.

Right now, you have a 12K autowarm count. That means your cache will
eventually always contain 12K entries whether or not you ever use the
last 11K! I'm simplifying a bit, but it grows like this.

Let's say I start Solr. Initially it has no cache entries. Now I start
both querying and indexing. For simplicity, say I have 100 _new_  fq
clauses come in between each commit. The first commit will autowarm
100. The next will autowarm 200, then 300...... etc. Eventually this
will grow to 12K. So your performance will start to vary depending on
how long Solr has been running.

Worse. it's not clear that you _ever_ re-use those clauses. One example:
fq=date_field:[* TO NOW]
NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
from the first one will not re-use the entry. In the worst case almost
all of your autwarming is useless. It neither loads relevant index
data into RAM nor is reusable.

Even if you use "date math" to round to, say, a minute, if you run
Solr long enough you'll still fill up with useless fq clauses.

Best,
Erick

On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> @Erick:
>
>
> We will try to lower the autowarm and run some tests to compare.
>
> If I get your point, having a big cache might cause more troubles than help if the cache hit ratio is not high enough because the cache is constantly evicting/inserting entries?
>
>
>
> @Jeremy:
>
>
> Index size: ~20G and ~14M documents
>
> Server memory available: 256G from which ~30G used and ~100G system cache
>
> Server CPU count: 32, ~10% usage
>
> JVM memory settings: -Xms12G -Xmx12G
>
>
> We have 3 servers and 3 clusters of 3 Solr instances.
>
> That is each server hosts 1 Solr instance for each cluster.
>
> And, indeed, each cluster only has 1 shard with replication factor 3.
>
>
> Among all these Solr instances, the pauses are observed on only one single cluster but on every server at different times (sometimes on all servers at the same time but I would say it's very rare).
>
> We do observe the traffic is evenly balanced across the 3 servers, around 30-40 queries per second sent to each server.
>
>
>
> Regards,
>
> Gaël
>
>
> ________________________________
> De : Branham, Jeremy (Experis) <[hidden email]>
> Envoyé : mardi 15 janvier 2019 17:59:56
> À : [hidden email]
> Objet : Re: Re: Delayed/waiting requests
>
> Hi Gael –
>
> Could you share this information?
> Size of the index
> Server memory available
> Server CPU count
> JVM memory settings
>
> You mentioned a cloud configuration of 3 replicas.
> Does that mean you have 1 shard with a replication factor of 3?
> Do the pauses occur on all 3 servers?
> Is the traffic evenly balanced across those servers?
>
>
> Jeremy Branham
> [hidden email]
>
>
> On 1/15/19, 9:50 AM, "Erick Erickson" <[hidden email]> wrote:
>
>     Well, it was a nice theory anyway.
>
>     "Other collections with the same settings"
>     doesn't really mean much unless those other collections are very similar,
>     especially in terms of numbers of docs.
>
>     You should only see a new searcher opening when you do a
>     hard-commit-with-opensearcher-true or soft commit.
>
>     So what happens when you just try lowering the autowarm
>     count? I'm assuming you're free to test in some non-prod
>     system.
>
>     Focusing on the hit ratio is something of a red herring. Remember
>     that each entry in your filterCache is roughly maxDoc/8 + a little
>     overhead, the increase in GC pressure has to be balanced
>     against getting the hits from the cache.
>
>     Now, all that said if there's no correlation, then you need to put
>     a profiler on the system when you see this kind of thing and
>     find out where the hotspots are, otherwise it's guesswork and
>     I'm out of ideas.
>
>     Best,
>     Erick
>
>     On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
>     <[hidden email]> wrote:
>     >
>     > Hi Erick,
>     >
>     >
>     > Thank you for your detailed answer, I better understand autowarming.
>     >
>     >
>     > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
>     >
>     > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
>     >
>     >
>     > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
>     >
>     > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
>     >
>     >
>     > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
>     >
>     >
>     > Regards,
>     >
>     > ________________________________
>     > De : Erick Erickson <[hidden email]>
>     > Envoyé : lundi 14 janvier 2019 17:44:38
>     > À : solr-user
>     > Objet : Re: Delayed/waiting requests
>     >
>     > Gael:
>     >
>     > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
>     > is also the maxSize
>     >
>     > That is far, far, far too many. Let's assume you actually have 12K
>     > entries in the filterCache.
>     > Every time you open a new searcher, 12K queries are executed _before_
>     > the searcher
>     > accepts any new requests. While being able to re-use a filterCache
>     > entry is useful, one of
>     > the primary purposes is to pre-load index data from disk into memory
>     > which can be
>     > the event that takes the most time.
>     >
>     > The queryResultCache has a similar function. I often find that this
>     > cache doesn't have a
>     > very high hit ratio, but again executing a _few_ of these queries
>     > warms the index from
>     > disk.
>     >
>     > I think of both caches as a map, where the key is the "thing", (fq
>     > clause in the case
>     > of filterCache, the whole query in the case of the queryResultCache).
>     > Autowarming
>     > replays the most recently executed N of these entries, essentially
>     > just as though
>     > they were submitted by a user.
>     >
>     > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
>     > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
>     > you have such excessive autowarming going on that it's hard to see the
>     > associated messages in the log.
>     >
>     > Here's what I'd recommend: Set your autowarm counts to something on the order
>     > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
>     > be much less severe. It _might_ be that your users see some increased (very
>     > temporary) variance in response time. You can tell that the autowarming
>     > configurations are "more art than science", I can't give you any other
>     > recommendations than "start small and increase until you're happy"
>     > unfortunately.
>     >
>     > I usually do this with some kind of load tester in a dev lab of course ;).
>     >
>     > Finally, if you use the metrics data (see:
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
>     > you can see the autowarm times. Don't get too lost in the page to
>     > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
>     > and look for "warmupTime", then refine on how to get _only_
>     > the warmup stats ;).
>     >
>     > Best,
>     > Erick
>     >
>     > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
>     > <[hidden email]> wrote:
>     > >
>     > > I had a look to GC logs this morning but I'm not sure how to interpret them.
>     > >
>     > >
>     > > Over a period of 54mn, there is:
>     > >
>     > > - Number of pauses: 2739
>     > >
>     > > - Accumulated pauses: 93s => that is 2.86% of the time
>     > >
>     > > - Average pause duration: 0.03s
>     > >
>     > > - Average pause interval: 1.18s
>     > >
>     > > - Accumulated full GC: 0
>     > >
>     > > I'm not sure if this is a lot or not. What do you think ?
>     > >
>     > >
>     > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
>     > >
>     > >
>     > > Also we are indeed indexing at the same time but we have autowarming set.
>     > >
>     > > I don't see any Searcher opened at the time we experience slowness.
>     > >
>     > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
>     > >
>     > > Could this have any downside?
>     > >
>     > >
>     > > Thanks,
>     > >
>     > > Gaël
>     > >
>     > >
>     > > ________________________________
>     > > De : Erick Erickson <[hidden email]>
>     > > Envoyé : vendredi 11 janvier 2019 17:21
>     > > À : solr-user
>     > > Objet : Re: Delayed/waiting requests
>     > >
>     > > Jimi's comment is one of the very common culprits.
>     > >
>     > > Autowarming is another. Are you indexing at the same
>     > > time? If so it could well be  you aren't autowarming and
>     > > the spikes are caused by using a new IndexSearcher
>     > > that has to read much of the index off disk when commits
>     > > happen. The "smoking gun" here would be if the spikes
>     > > correlate to your commits (soft or hard-with-opensearcher-true).
>     > >
>     > > Best,
>     > > Erick
>     > >
>     > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
>     > > <[hidden email]> wrote:
>     > > >
>     > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
>     > > >
>     > > > I will collect data over the week-end and look at it.
>     > > >
>     > > >
>     > > > Thanks
>     > > >
>     > > > ________________________________
>     > > > De : Hullegård, Jimi <[hidden email]>
>     > > > Envoyé : vendredi 11 janvier 2019 03:46:02
>     > > > À : [hidden email]
>     > > > Objet : Re: Delayed/waiting requests
>     > > >
>     > > > Could be caused by garbage collection in the jvm.
>     > > >
>     > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
>     > > >
>     > > > Go down to the segment called “GC pause problems”
>     > > >
>     > > > /Jimi
>     > > >
>     > > > Sent from my iPhone
>     > > >
>     > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
>     > > >
>     > > > Hello,
>     > > >
>     > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
>     > > >
>     > > > Load: between 30 and 40 queries per second, constant over time of analysis
>     > > >
>     > > > Symptoms: high response time over short period of time but quite frequently.
>     > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
>     > > >
>     > > > What we found out: just before response time increase, requests seems to be delayed.
>     > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
>     > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
>     > > >
>     > > > Does someone can think of something that could explain this in the way Solr works ?
>     > > > Or ideas to track down the root cause..
>     > > >
>     > > > Solr version is 7.2.1.
>     > > >
>     > > > Thanks for reading,
>     > > >
>     > > > Gaël Jourdan-Weil
>     > > >
>     > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Delayed/waiting requests

Shawn Heisey-2
In reply to this post by Gael Jourdan-Weil
On 1/15/2019 10:33 AM, Gael Jourdan-Weil wrote:
> Index size: ~20G and ~14M documents
>
> Server memory available: 256G from which ~30G used and ~100G system cache
>
> Server CPU count: 32, ~10% usage
>
> JVM memory settings: -Xms12G -Xmx12G

Can you create a process listing screenshot as described at this URL? 
You'll need to use a file sharing website to provide us with a URL to
access the file.  When done properly, the screenshot provides a lot of
useful information.

https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

It would be best if the screenshot is gathered when you're experiencing
the problem.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: Re: Delayed/waiting requests

Gael Jourdan-Weil
In reply to this post by Erick Erickson
Ok, I get your point.


Do you know if there is a tool to easily view filterCache content?

I know we can see the top entries in the API or the UI but could we see more?


Regards,

Gaël

________________________________
De : Erick Erickson <[hidden email]>
Envoyé : mardi 15 janvier 2019 19:46:19
À : solr-user
Objet : Re: Re: Delayed/waiting requests

bq. If I get your point, having a big cache might cause more troubles
than help if the cache hit ratio is not high enough because the cache
is constantly evicting/inserting entries?

Pretty much. Although there are nuances.

Right now, you have a 12K autowarm count. That means your cache will
eventually always contain 12K entries whether or not you ever use the
last 11K! I'm simplifying a bit, but it grows like this.

Let's say I start Solr. Initially it has no cache entries. Now I start
both querying and indexing. For simplicity, say I have 100 _new_  fq
clauses come in between each commit. The first commit will autowarm
100. The next will autowarm 200, then 300...... etc. Eventually this
will grow to 12K. So your performance will start to vary depending on
how long Solr has been running.

Worse. it's not clear that you _ever_ re-use those clauses. One example:
fq=date_field:[* TO NOW]
NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
from the first one will not re-use the entry. In the worst case almost
all of your autwarming is useless. It neither loads relevant index
data into RAM nor is reusable.

Even if you use "date math" to round to, say, a minute, if you run
Solr long enough you'll still fill up with useless fq clauses.

Best,
Erick

On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> @Erick:
>
>
> We will try to lower the autowarm and run some tests to compare.
>
> If I get your point, having a big cache might cause more troubles than help if the cache hit ratio is not high enough because the cache is constantly evicting/inserting entries?
>
>
>
> @Jeremy:
>
>
> Index size: ~20G and ~14M documents
>
> Server memory available: 256G from which ~30G used and ~100G system cache
>
> Server CPU count: 32, ~10% usage
>
> JVM memory settings: -Xms12G -Xmx12G
>
>
> We have 3 servers and 3 clusters of 3 Solr instances.
>
> That is each server hosts 1 Solr instance for each cluster.
>
> And, indeed, each cluster only has 1 shard with replication factor 3.
>
>
> Among all these Solr instances, the pauses are observed on only one single cluster but on every server at different times (sometimes on all servers at the same time but I would say it's very rare).
>
> We do observe the traffic is evenly balanced across the 3 servers, around 30-40 queries per second sent to each server.
>
>
>
> Regards,
>
> Gaël
>
>
> ________________________________
> De : Branham, Jeremy (Experis) <[hidden email]>
> Envoyé : mardi 15 janvier 2019 17:59:56
> À : [hidden email]
> Objet : Re: Re: Delayed/waiting requests
>
> Hi Gael –
>
> Could you share this information?
> Size of the index
> Server memory available
> Server CPU count
> JVM memory settings
>
> You mentioned a cloud configuration of 3 replicas.
> Does that mean you have 1 shard with a replication factor of 3?
> Do the pauses occur on all 3 servers?
> Is the traffic evenly balanced across those servers?
>
>
> Jeremy Branham
> [hidden email]
>
>
> On 1/15/19, 9:50 AM, "Erick Erickson" <[hidden email]> wrote:
>
>     Well, it was a nice theory anyway.
>
>     "Other collections with the same settings"
>     doesn't really mean much unless those other collections are very similar,
>     especially in terms of numbers of docs.
>
>     You should only see a new searcher opening when you do a
>     hard-commit-with-opensearcher-true or soft commit.
>
>     So what happens when you just try lowering the autowarm
>     count? I'm assuming you're free to test in some non-prod
>     system.
>
>     Focusing on the hit ratio is something of a red herring. Remember
>     that each entry in your filterCache is roughly maxDoc/8 + a little
>     overhead, the increase in GC pressure has to be balanced
>     against getting the hits from the cache.
>
>     Now, all that said if there's no correlation, then you need to put
>     a profiler on the system when you see this kind of thing and
>     find out where the hotspots are, otherwise it's guesswork and
>     I'm out of ideas.
>
>     Best,
>     Erick
>
>     On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
>     <[hidden email]> wrote:
>     >
>     > Hi Erick,
>     >
>     >
>     > Thank you for your detailed answer, I better understand autowarming.
>     >
>     >
>     > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
>     >
>     > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
>     >
>     >
>     > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
>     >
>     > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
>     >
>     >
>     > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
>     >
>     >
>     > Regards,
>     >
>     > ________________________________
>     > De : Erick Erickson <[hidden email]>
>     > Envoyé : lundi 14 janvier 2019 17:44:38
>     > À : solr-user
>     > Objet : Re: Delayed/waiting requests
>     >
>     > Gael:
>     >
>     > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
>     > is also the maxSize
>     >
>     > That is far, far, far too many. Let's assume you actually have 12K
>     > entries in the filterCache.
>     > Every time you open a new searcher, 12K queries are executed _before_
>     > the searcher
>     > accepts any new requests. While being able to re-use a filterCache
>     > entry is useful, one of
>     > the primary purposes is to pre-load index data from disk into memory
>     > which can be
>     > the event that takes the most time.
>     >
>     > The queryResultCache has a similar function. I often find that this
>     > cache doesn't have a
>     > very high hit ratio, but again executing a _few_ of these queries
>     > warms the index from
>     > disk.
>     >
>     > I think of both caches as a map, where the key is the "thing", (fq
>     > clause in the case
>     > of filterCache, the whole query in the case of the queryResultCache).
>     > Autowarming
>     > replays the most recently executed N of these entries, essentially
>     > just as though
>     > they were submitted by a user.
>     >
>     > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
>     > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
>     > you have such excessive autowarming going on that it's hard to see the
>     > associated messages in the log.
>     >
>     > Here's what I'd recommend: Set your autowarm counts to something on the order
>     > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
>     > be much less severe. It _might_ be that your users see some increased (very
>     > temporary) variance in response time. You can tell that the autowarming
>     > configurations are "more art than science", I can't give you any other
>     > recommendations than "start small and increase until you're happy"
>     > unfortunately.
>     >
>     > I usually do this with some kind of load tester in a dev lab of course ;).
>     >
>     > Finally, if you use the metrics data (see:
>     > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
>     > you can see the autowarm times. Don't get too lost in the page to
>     > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
>     > and look for "warmupTime", then refine on how to get _only_
>     > the warmup stats ;).
>     >
>     > Best,
>     > Erick
>     >
>     > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
>     > <[hidden email]> wrote:
>     > >
>     > > I had a look to GC logs this morning but I'm not sure how to interpret them.
>     > >
>     > >
>     > > Over a period of 54mn, there is:
>     > >
>     > > - Number of pauses: 2739
>     > >
>     > > - Accumulated pauses: 93s => that is 2.86% of the time
>     > >
>     > > - Average pause duration: 0.03s
>     > >
>     > > - Average pause interval: 1.18s
>     > >
>     > > - Accumulated full GC: 0
>     > >
>     > > I'm not sure if this is a lot or not. What do you think ?
>     > >
>     > >
>     > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
>     > >
>     > >
>     > > Also we are indeed indexing at the same time but we have autowarming set.
>     > >
>     > > I don't see any Searcher opened at the time we experience slowness.
>     > >
>     > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
>     > >
>     > > Could this have any downside?
>     > >
>     > >
>     > > Thanks,
>     > >
>     > > Gaël
>     > >
>     > >
>     > > ________________________________
>     > > De : Erick Erickson <[hidden email]>
>     > > Envoyé : vendredi 11 janvier 2019 17:21
>     > > À : solr-user
>     > > Objet : Re: Delayed/waiting requests
>     > >
>     > > Jimi's comment is one of the very common culprits.
>     > >
>     > > Autowarming is another. Are you indexing at the same
>     > > time? If so it could well be  you aren't autowarming and
>     > > the spikes are caused by using a new IndexSearcher
>     > > that has to read much of the index off disk when commits
>     > > happen. The "smoking gun" here would be if the spikes
>     > > correlate to your commits (soft or hard-with-opensearcher-true).
>     > >
>     > > Best,
>     > > Erick
>     > >
>     > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
>     > > <[hidden email]> wrote:
>     > > >
>     > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
>     > > >
>     > > > I will collect data over the week-end and look at it.
>     > > >
>     > > >
>     > > > Thanks
>     > > >
>     > > > ________________________________
>     > > > De : Hullegård, Jimi <[hidden email]>
>     > > > Envoyé : vendredi 11 janvier 2019 03:46:02
>     > > > À : [hidden email]
>     > > > Objet : Re: Delayed/waiting requests
>     > > >
>     > > > Could be caused by garbage collection in the jvm.
>     > > >
>     > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
>     > > >
>     > > > Go down to the segment called “GC pause problems”
>     > > >
>     > > > /Jimi
>     > > >
>     > > > Sent from my iPhone
>     > > >
>     > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
>     > > >
>     > > > Hello,
>     > > >
>     > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
>     > > >
>     > > > Load: between 30 and 40 queries per second, constant over time of analysis
>     > > >
>     > > > Symptoms: high response time over short period of time but quite frequently.
>     > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
>     > > >
>     > > > What we found out: just before response time increase, requests seems to be delayed.
>     > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
>     > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
>     > > >
>     > > > Does someone can think of something that could explain this in the way Solr works ?
>     > > > Or ideas to track down the root cause..
>     > > >
>     > > > Solr version is 7.2.1.
>     > > >
>     > > > Thanks for reading,
>     > > >
>     > > > Gaël Jourdan-Weil
>     > > >
>     > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Delayed/waiting requests

Erick Erickson
I don't know of any tools to inspect the cache. Under the covers,
these are things like Java's ConcurrentHashMap which don't, for
instance, carry along information like last access time IIUC.

I usually have to cull the Solr logs and eyeball the fq clauses to see
if anything jumps out. If you do find any such patterns, you can
always add {!cache=false} to those clauses to not use up cache
entries....

Best,
Erick

On Wed, Jan 16, 2019 at 7:53 AM Gael Jourdan-Weil
<[hidden email]> wrote:

>
> Ok, I get your point.
>
>
> Do you know if there is a tool to easily view filterCache content?
>
> I know we can see the top entries in the API or the UI but could we see more?
>
>
> Regards,
>
> Gaël
>
> ________________________________
> De : Erick Erickson <[hidden email]>
> Envoyé : mardi 15 janvier 2019 19:46:19
> À : solr-user
> Objet : Re: Re: Delayed/waiting requests
>
> bq. If I get your point, having a big cache might cause more troubles
> than help if the cache hit ratio is not high enough because the cache
> is constantly evicting/inserting entries?
>
> Pretty much. Although there are nuances.
>
> Right now, you have a 12K autowarm count. That means your cache will
> eventually always contain 12K entries whether or not you ever use the
> last 11K! I'm simplifying a bit, but it grows like this.
>
> Let's say I start Solr. Initially it has no cache entries. Now I start
> both querying and indexing. For simplicity, say I have 100 _new_  fq
> clauses come in between each commit. The first commit will autowarm
> 100. The next will autowarm 200, then 300...... etc. Eventually this
> will grow to 12K. So your performance will start to vary depending on
> how long Solr has been running.
>
> Worse. it's not clear that you _ever_ re-use those clauses. One example:
> fq=date_field:[* TO NOW]
> NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
> from the first one will not re-use the entry. In the worst case almost
> all of your autwarming is useless. It neither loads relevant index
> data into RAM nor is reusable.
>
> Even if you use "date math" to round to, say, a minute, if you run
> Solr long enough you'll still fill up with useless fq clauses.
>
> Best,
> Erick
>
> On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
> <[hidden email]> wrote:
> >
> > @Erick:
> >
> >
> > We will try to lower the autowarm and run some tests to compare.
> >
> > If I get your point, having a big cache might cause more troubles than help if the cache hit ratio is not high enough because the cache is constantly evicting/inserting entries?
> >
> >
> >
> > @Jeremy:
> >
> >
> > Index size: ~20G and ~14M documents
> >
> > Server memory available: 256G from which ~30G used and ~100G system cache
> >
> > Server CPU count: 32, ~10% usage
> >
> > JVM memory settings: -Xms12G -Xmx12G
> >
> >
> > We have 3 servers and 3 clusters of 3 Solr instances.
> >
> > That is each server hosts 1 Solr instance for each cluster.
> >
> > And, indeed, each cluster only has 1 shard with replication factor 3.
> >
> >
> > Among all these Solr instances, the pauses are observed on only one single cluster but on every server at different times (sometimes on all servers at the same time but I would say it's very rare).
> >
> > We do observe the traffic is evenly balanced across the 3 servers, around 30-40 queries per second sent to each server.
> >
> >
> >
> > Regards,
> >
> > Gaël
> >
> >
> > ________________________________
> > De : Branham, Jeremy (Experis) <[hidden email]>
> > Envoyé : mardi 15 janvier 2019 17:59:56
> > À : [hidden email]
> > Objet : Re: Re: Delayed/waiting requests
> >
> > Hi Gael –
> >
> > Could you share this information?
> > Size of the index
> > Server memory available
> > Server CPU count
> > JVM memory settings
> >
> > You mentioned a cloud configuration of 3 replicas.
> > Does that mean you have 1 shard with a replication factor of 3?
> > Do the pauses occur on all 3 servers?
> > Is the traffic evenly balanced across those servers?
> >
> >
> > Jeremy Branham
> > [hidden email]
> >
> >
> > On 1/15/19, 9:50 AM, "Erick Erickson" <[hidden email]> wrote:
> >
> >     Well, it was a nice theory anyway.
> >
> >     "Other collections with the same settings"
> >     doesn't really mean much unless those other collections are very similar,
> >     especially in terms of numbers of docs.
> >
> >     You should only see a new searcher opening when you do a
> >     hard-commit-with-opensearcher-true or soft commit.
> >
> >     So what happens when you just try lowering the autowarm
> >     count? I'm assuming you're free to test in some non-prod
> >     system.
> >
> >     Focusing on the hit ratio is something of a red herring. Remember
> >     that each entry in your filterCache is roughly maxDoc/8 + a little
> >     overhead, the increase in GC pressure has to be balanced
> >     against getting the hits from the cache.
> >
> >     Now, all that said if there's no correlation, then you need to put
> >     a profiler on the system when you see this kind of thing and
> >     find out where the hotspots are, otherwise it's guesswork and
> >     I'm out of ideas.
> >
> >     Best,
> >     Erick
> >
> >     On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
> >     <[hidden email]> wrote:
> >     >
> >     > Hi Erick,
> >     >
> >     >
> >     > Thank you for your detailed answer, I better understand autowarming.
> >     >
> >     >
> >     > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
> >     >
> >     > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
> >     >
> >     >
> >     > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
> >     >
> >     > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
> >     >
> >     >
> >     > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
> >     >
> >     >
> >     > Regards,
> >     >
> >     > ________________________________
> >     > De : Erick Erickson <[hidden email]>
> >     > Envoyé : lundi 14 janvier 2019 17:44:38
> >     > À : solr-user
> >     > Objet : Re: Delayed/waiting requests
> >     >
> >     > Gael:
> >     >
> >     > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
> >     > is also the maxSize
> >     >
> >     > That is far, far, far too many. Let's assume you actually have 12K
> >     > entries in the filterCache.
> >     > Every time you open a new searcher, 12K queries are executed _before_
> >     > the searcher
> >     > accepts any new requests. While being able to re-use a filterCache
> >     > entry is useful, one of
> >     > the primary purposes is to pre-load index data from disk into memory
> >     > which can be
> >     > the event that takes the most time.
> >     >
> >     > The queryResultCache has a similar function. I often find that this
> >     > cache doesn't have a
> >     > very high hit ratio, but again executing a _few_ of these queries
> >     > warms the index from
> >     > disk.
> >     >
> >     > I think of both caches as a map, where the key is the "thing", (fq
> >     > clause in the case
> >     > of filterCache, the whole query in the case of the queryResultCache).
> >     > Autowarming
> >     > replays the most recently executed N of these entries, essentially
> >     > just as though
> >     > they were submitted by a user.
> >     >
> >     > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
> >     > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
> >     > you have such excessive autowarming going on that it's hard to see the
> >     > associated messages in the log.
> >     >
> >     > Here's what I'd recommend: Set your autowarm counts to something on the order
> >     > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
> >     > be much less severe. It _might_ be that your users see some increased (very
> >     > temporary) variance in response time. You can tell that the autowarming
> >     > configurations are "more art than science", I can't give you any other
> >     > recommendations than "start small and increase until you're happy"
> >     > unfortunately.
> >     >
> >     > I usually do this with some kind of load tester in a dev lab of course ;).
> >     >
> >     > Finally, if you use the metrics data (see:
> >     > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
> >     > you can see the autowarm times. Don't get too lost in the page to
> >     > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
> >     > and look for "warmupTime", then refine on how to get _only_
> >     > the warmup stats ;).
> >     >
> >     > Best,
> >     > Erick
> >     >
> >     > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
> >     > <[hidden email]> wrote:
> >     > >
> >     > > I had a look to GC logs this morning but I'm not sure how to interpret them.
> >     > >
> >     > >
> >     > > Over a period of 54mn, there is:
> >     > >
> >     > > - Number of pauses: 2739
> >     > >
> >     > > - Accumulated pauses: 93s => that is 2.86% of the time
> >     > >
> >     > > - Average pause duration: 0.03s
> >     > >
> >     > > - Average pause interval: 1.18s
> >     > >
> >     > > - Accumulated full GC: 0
> >     > >
> >     > > I'm not sure if this is a lot or not. What do you think ?
> >     > >
> >     > >
> >     > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
> >     > >
> >     > >
> >     > > Also we are indeed indexing at the same time but we have autowarming set.
> >     > >
> >     > > I don't see any Searcher opened at the time we experience slowness.
> >     > >
> >     > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
> >     > >
> >     > > Could this have any downside?
> >     > >
> >     > >
> >     > > Thanks,
> >     > >
> >     > > Gaël
> >     > >
> >     > >
> >     > > ________________________________
> >     > > De : Erick Erickson <[hidden email]>
> >     > > Envoyé : vendredi 11 janvier 2019 17:21
> >     > > À : solr-user
> >     > > Objet : Re: Delayed/waiting requests
> >     > >
> >     > > Jimi's comment is one of the very common culprits.
> >     > >
> >     > > Autowarming is another. Are you indexing at the same
> >     > > time? If so it could well be  you aren't autowarming and
> >     > > the spikes are caused by using a new IndexSearcher
> >     > > that has to read much of the index off disk when commits
> >     > > happen. The "smoking gun" here would be if the spikes
> >     > > correlate to your commits (soft or hard-with-opensearcher-true).
> >     > >
> >     > > Best,
> >     > > Erick
> >     > >
> >     > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
> >     > > <[hidden email]> wrote:
> >     > > >
> >     > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
> >     > > >
> >     > > > I will collect data over the week-end and look at it.
> >     > > >
> >     > > >
> >     > > > Thanks
> >     > > >
> >     > > > ________________________________
> >     > > > De : Hullegård, Jimi <[hidden email]>
> >     > > > Envoyé : vendredi 11 janvier 2019 03:46:02
> >     > > > À : [hidden email]
> >     > > > Objet : Re: Delayed/waiting requests
> >     > > >
> >     > > > Could be caused by garbage collection in the jvm.
> >     > > >
> >     > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
> >     > > >
> >     > > > Go down to the segment called “GC pause problems”
> >     > > >
> >     > > > /Jimi
> >     > > >
> >     > > > Sent from my iPhone
> >     > > >
> >     > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
> >     > > >
> >     > > > Hello,
> >     > > >
> >     > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
> >     > > >
> >     > > > Load: between 30 and 40 queries per second, constant over time of analysis
> >     > > >
> >     > > > Symptoms: high response time over short period of time but quite frequently.
> >     > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
> >     > > >
> >     > > > What we found out: just before response time increase, requests seems to be delayed.
> >     > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> >     > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
> >     > > >
> >     > > > Does someone can think of something that could explain this in the way Solr works ?
> >     > > > Or ideas to track down the root cause..
> >     > > >
> >     > > > Solr version is 7.2.1.
> >     > > >
> >     > > > Thanks for reading,
> >     > > >
> >     > > > Gaël Jourdan-Weil
> >     > > >
> >     > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>
> >
> >
Reply | Threaded
Open this post in threaded view
|

RE: Re: Delayed/waiting requests

Markus Jelsma-2
In reply to this post by Gael Jourdan-Weil
Hello,

There is an extremely undocumented parameter to get the cache's contents displayed. Set showItems="100" on the filter cache.

Regards,
Markus

 
 
-----Original message-----

> From:Erick Erickson <[hidden email]>
> Sent: Wednesday 16th January 2019 17:40
> To: solr-user <[hidden email]>
> Subject: Re: Re: Delayed/waiting requests
>
> I don't know of any tools to inspect the cache. Under the covers,
> these are things like Java's ConcurrentHashMap which don't, for
> instance, carry along information like last access time IIUC.
>
> I usually have to cull the Solr logs and eyeball the fq clauses to see
> if anything jumps out. If you do find any such patterns, you can
> always add {!cache=false} to those clauses to not use up cache
> entries....
>
> Best,
> Erick
>
> On Wed, Jan 16, 2019 at 7:53 AM Gael Jourdan-Weil
> <[hidden email]> wrote:
> >
> > Ok, I get your point.
> >
> >
> > Do you know if there is a tool to easily view filterCache content?
> >
> > I know we can see the top entries in the API or the UI but could we see more?
> >
> >
> > Regards,
> >
> > Gaël
> >
> > ________________________________
> > De : Erick Erickson <[hidden email]>
> > Envoyé : mardi 15 janvier 2019 19:46:19
> > À : solr-user
> > Objet : Re: Re: Delayed/waiting requests
> >
> > bq. If I get your point, having a big cache might cause more troubles
> > than help if the cache hit ratio is not high enough because the cache
> > is constantly evicting/inserting entries?
> >
> > Pretty much. Although there are nuances.
> >
> > Right now, you have a 12K autowarm count. That means your cache will
> > eventually always contain 12K entries whether or not you ever use the
> > last 11K! I'm simplifying a bit, but it grows like this.
> >
> > Let's say I start Solr. Initially it has no cache entries. Now I start
> > both querying and indexing. For simplicity, say I have 100 _new_  fq
> > clauses come in between each commit. The first commit will autowarm
> > 100. The next will autowarm 200, then 300...... etc. Eventually this
> > will grow to 12K. So your performance will start to vary depending on
> > how long Solr has been running.
> >
> > Worse. it's not clear that you _ever_ re-use those clauses. One example:
> > fq=date_field:[* TO NOW]
> > NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
> > from the first one will not re-use the entry. In the worst case almost
> > all of your autwarming is useless. It neither loads relevant index
> > data into RAM nor is reusable.
> >
> > Even if you use "date math" to round to, say, a minute, if you run
> > Solr long enough you'll still fill up with useless fq clauses.
> >
> > Best,
> > Erick
> >
> > On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
> > <[hidden email]> wrote:
> > >
> > > @Erick:
> > >
> > >
> > > We will try to lower the autowarm and run some tests to compare.
> > >
> > > If I get your point, having a big cache might cause more troubles than help if the cache hit ratio is not high enough because the cache is constantly evicting/inserting entries?
> > >
> > >
> > >
> > > @Jeremy:
> > >
> > >
> > > Index size: ~20G and ~14M documents
> > >
> > > Server memory available: 256G from which ~30G used and ~100G system cache
> > >
> > > Server CPU count: 32, ~10% usage
> > >
> > > JVM memory settings: -Xms12G -Xmx12G
> > >
> > >
> > > We have 3 servers and 3 clusters of 3 Solr instances.
> > >
> > > That is each server hosts 1 Solr instance for each cluster.
> > >
> > > And, indeed, each cluster only has 1 shard with replication factor 3.
> > >
> > >
> > > Among all these Solr instances, the pauses are observed on only one single cluster but on every server at different times (sometimes on all servers at the same time but I would say it's very rare).
> > >
> > > We do observe the traffic is evenly balanced across the 3 servers, around 30-40 queries per second sent to each server.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Gaël
> > >
> > >
> > > ________________________________
> > > De : Branham, Jeremy (Experis) <[hidden email]>
> > > Envoyé : mardi 15 janvier 2019 17:59:56
> > > À : [hidden email]
> > > Objet : Re: Re: Delayed/waiting requests
> > >
> > > Hi Gael –
> > >
> > > Could you share this information?
> > > Size of the index
> > > Server memory available
> > > Server CPU count
> > > JVM memory settings
> > >
> > > You mentioned a cloud configuration of 3 replicas.
> > > Does that mean you have 1 shard with a replication factor of 3?
> > > Do the pauses occur on all 3 servers?
> > > Is the traffic evenly balanced across those servers?
> > >
> > >
> > > Jeremy Branham
> > > [hidden email]
> > >
> > >
> > > On 1/15/19, 9:50 AM, "Erick Erickson" <[hidden email]> wrote:
> > >
> > >     Well, it was a nice theory anyway.
> > >
> > >     "Other collections with the same settings"
> > >     doesn't really mean much unless those other collections are very similar,
> > >     especially in terms of numbers of docs.
> > >
> > >     You should only see a new searcher opening when you do a
> > >     hard-commit-with-opensearcher-true or soft commit.
> > >
> > >     So what happens when you just try lowering the autowarm
> > >     count? I'm assuming you're free to test in some non-prod
> > >     system.
> > >
> > >     Focusing on the hit ratio is something of a red herring. Remember
> > >     that each entry in your filterCache is roughly maxDoc/8 + a little
> > >     overhead, the increase in GC pressure has to be balanced
> > >     against getting the hits from the cache.
> > >
> > >     Now, all that said if there's no correlation, then you need to put
> > >     a profiler on the system when you see this kind of thing and
> > >     find out where the hotspots are, otherwise it's guesswork and
> > >     I'm out of ideas.
> > >
> > >     Best,
> > >     Erick
> > >
> > >     On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
> > >     <[hidden email]> wrote:
> > >     >
> > >     > Hi Erick,
> > >     >
> > >     >
> > >     > Thank you for your detailed answer, I better understand autowarming.
> > >     >
> > >     >
> > >     > We have an autowarming time of ~10s for filterCache (queryResultCache is not used at all, ratio = 0.02).
> > >     >
> > >     > We increased the size of the filterCache from 6k to 12k (and autowarming size set to same values) to have a better ratio which is _only_ around 0.85/0.90.
> > >     >
> > >     >
> > >     > The thing I don't understand is I should see "Opening new searcher" in the logs everytime a new searcher is opened and thus an autowarming happens, right?
> > >     >
> > >     > But I don't see "Opening new searcher" very often, and I don't see it being correlated with the response time peaks.
> > >     >
> > >     >
> > >     > Also, I didn't mention it earlier but, we have other SolrCloud clusters with similar settings and load (~10s filterCache autowarming, 10k entries) and we don't observe the same behavior.
> > >     >
> > >     >
> > >     > Regards,
> > >     >
> > >     > ________________________________
> > >     > De : Erick Erickson <[hidden email]>
> > >     > Envoyé : lundi 14 janvier 2019 17:44:38
> > >     > À : solr-user
> > >     > Objet : Re: Delayed/waiting requests
> > >     >
> > >     > Gael:
> > >     >
> > >     > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
> > >     > is also the maxSize
> > >     >
> > >     > That is far, far, far too many. Let's assume you actually have 12K
> > >     > entries in the filterCache.
> > >     > Every time you open a new searcher, 12K queries are executed _before_
> > >     > the searcher
> > >     > accepts any new requests. While being able to re-use a filterCache
> > >     > entry is useful, one of
> > >     > the primary purposes is to pre-load index data from disk into memory
> > >     > which can be
> > >     > the event that takes the most time.
> > >     >
> > >     > The queryResultCache has a similar function. I often find that this
> > >     > cache doesn't have a
> > >     > very high hit ratio, but again executing a _few_ of these queries
> > >     > warms the index from
> > >     > disk.
> > >     >
> > >     > I think of both caches as a map, where the key is the "thing", (fq
> > >     > clause in the case
> > >     > of filterCache, the whole query in the case of the queryResultCache).
> > >     > Autowarming
> > >     > replays the most recently executed N of these entries, essentially
> > >     > just as though
> > >     > they were submitted by a user.
> > >     >
> > >     > Hypothesis: You're massively over-warming, and when that kicks in you're seeing
> > >     > increased CPU and GC pressure leading to the anomalies you're seeing. Further,
> > >     > you have such excessive autowarming going on that it's hard to see the
> > >     > associated messages in the log.
> > >     >
> > >     > Here's what I'd recommend: Set your autowarm counts to something on the order
> > >     > of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
> > >     > be much less severe. It _might_ be that your users see some increased (very
> > >     > temporary) variance in response time. You can tell that the autowarming
> > >     > configurations are "more art than science", I can't give you any other
> > >     > recommendations than "start small and increase until you're happy"
> > >     > unfortunately.
> > >     >
> > >     > I usually do this with some kind of load tester in a dev lab of course ;).
> > >     >
> > >     > Finally, if you use the metrics data (see:
> > >     > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
> > >     > you can see the autowarm times. Don't get too lost in the page to
> > >     > start, just hit the "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=" endpoint
> > >     > and look for "warmupTime", then refine on how to get _only_
> > >     > the warmup stats ;).
> > >     >
> > >     > Best,
> > >     > Erick
> > >     >
> > >     > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
> > >     > <[hidden email]> wrote:
> > >     > >
> > >     > > I had a look to GC logs this morning but I'm not sure how to interpret them.
> > >     > >
> > >     > >
> > >     > > Over a period of 54mn, there is:
> > >     > >
> > >     > > - Number of pauses: 2739
> > >     > >
> > >     > > - Accumulated pauses: 93s => that is 2.86% of the time
> > >     > >
> > >     > > - Average pause duration: 0.03s
> > >     > >
> > >     > > - Average pause interval: 1.18s
> > >     > >
> > >     > > - Accumulated full GC: 0
> > >     > >
> > >     > > I'm not sure if this is a lot or not. What do you think ?
> > >     > >
> > >     > >
> > >     > > Looking more closely to GC logs with GC Viewer, I can notice that the high response time peaks happens at the same time where GC pauses takes 2x more time (around 0.06s) than average.
> > >     > >
> > >     > >
> > >     > > Also we are indeed indexing at the same time but we have autowarming set.
> > >     > >
> > >     > > I don't see any Searcher opened at the time we experience slowness.
> > >     > >
> > >     > > Nevertheless, our filterCache is set to autowarm 12k entries which is also the maxSize.
> > >     > >
> > >     > > Could this have any downside?
> > >     > >
> > >     > >
> > >     > > Thanks,
> > >     > >
> > >     > > Gaël
> > >     > >
> > >     > >
> > >     > > ________________________________
> > >     > > De : Erick Erickson <[hidden email]>
> > >     > > Envoyé : vendredi 11 janvier 2019 17:21
> > >     > > À : solr-user
> > >     > > Objet : Re: Delayed/waiting requests
> > >     > >
> > >     > > Jimi's comment is one of the very common culprits.
> > >     > >
> > >     > > Autowarming is another. Are you indexing at the same
> > >     > > time? If so it could well be  you aren't autowarming and
> > >     > > the spikes are caused by using a new IndexSearcher
> > >     > > that has to read much of the index off disk when commits
> > >     > > happen. The "smoking gun" here would be if the spikes
> > >     > > correlate to your commits (soft or hard-with-opensearcher-true).
> > >     > >
> > >     > > Best,
> > >     > > Erick
> > >     > >
> > >     > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
> > >     > > <[hidden email]> wrote:
> > >     > > >
> > >     > > > Interesting indeed, we did not see anything with VisualVM but having a look at the GC logs could gives us more info, especially on the pauses.
> > >     > > >
> > >     > > > I will collect data over the week-end and look at it.
> > >     > > >
> > >     > > >
> > >     > > > Thanks
> > >     > > >
> > >     > > > ________________________________
> > >     > > > De : Hullegård, Jimi <[hidden email]>
> > >     > > > Envoyé : vendredi 11 janvier 2019 03:46:02
> > >     > > > À : [hidden email]
> > >     > > > Objet : Re: Delayed/waiting requests
> > >     > > >
> > >     > > > Could be caused by garbage collection in the jvm.
> > >     > > >
> > >     > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
> > >     > > >
> > >     > > > Go down to the segment called “GC pause problems”
> > >     > > >
> > >     > > > /Jimi
> > >     > > >
> > >     > > > Sent from my iPhone
> > >     > > >
> > >     > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil <[hidden email]<mailto:[hidden email]>> wrote:
> > >     > > >
> > >     > > > Hello,
> > >     > > >
> > >     > > > We are experiencing some performance issues on a simple SolrCloud cluster of 3 replicas (1 core) but what we found during our analysis seems a bit odd, so we thought the community could have relevant ideas on this.
> > >     > > >
> > >     > > > Load: between 30 and 40 queries per second, constant over time of analysis
> > >     > > >
> > >     > > > Symptoms: high response time over short period of time but quite frequently.
> > >     > > > We are talking about requests response time going from 50ms to 5000ms or even worse during less than 5 seconds, and then going back to normal.
> > >     > > >
> > >     > > > What we found out: just before response time increase, requests seems to be delayed.
> > >     > > > That is during 2/3 seconds, requests pile up, no response is sent, and then all requests are resolved and responses are all returned to the clients at the same time.
> > >     > > > Very much like if there was a lock happening somewhere. But we found no "lock" time nor at JVM or system level.
> > >     > > >
> > >     > > > Does someone can think of something that could explain this in the way Solr works ?
> > >     > > > Or ideas to track down the root cause..
> > >     > > >
> > >     > > > Solr version is 7.2.1.
> > >     > > >
> > >     > > > Thanks for reading,
> > >     > > >
> > >     > > > Gaël Jourdan-Weil
> > >     > > >
> > >     > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>
> > >
> > >
>