Solr 8.2 replicas use only 1 CPU at 100% every solr.autoCommit.maxTime minutes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 8.2 replicas use only 1 CPU at 100% every solr.autoCommit.maxTime minutes

Vangelis Katsikaros
Hi all

We run Solr 8.2.0
* with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]),
* on Ubuntu 18.04
* on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
* with -Xmx16g [1].

We have migrated from Solr 3.5 and in big core (16GB) replicas we have
started to suffer degraded service. The replica’s ReplicationHandler is in
[8] and the master’s updateHandler in [9].

We notice every 5 mins (the value for solr.autoCommit.maxTime) the
following:
* Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
and the rest of the CPUs are idle (mpstat [6]). In our previous setup with
Solr 3 we used up to 80% of all CPUs.
* During that time the solr queries suddenly take more than 1 second, up to
30 sec (or more). The same queries otherwise need less than 1 sec to
complete.
* The disk does not seem to be a bottleneck (iostat [4]).
* Memory does not seem to be a bottleneck (vmstat [5]).
* CPU (apart from the single CPU issue) does not seem to be a bottleneck
(mpstat [6] & pidstat [3]).
* We are no java/GC experts but It does not seem to be GC related [7].

We have tried reducing the heap to 8 and 2GB with no positive effect. We
have tested different autoCommit.maxTime values. Reducing it to 60 seconds
makes things unbearable. 5 minutes is not significantly different than 10.
Do you have any pointers to proceed debugging the issue?

Detailed example problem that repeats every solr.autoCommit.maxTime minutes
on the replicas:
* From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count is
0 [2].
* From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr consumes
only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
* After 12:39:25 and until the next poll of a commit things are normal.

Regards
Vangelis

[1]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
[2]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
[3]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
[4]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
[5]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
[6]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
[7]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
[8]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
[9]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler
Reply | Threaded
Open this post in threaded view
|

Re: Solr 8.2 replicas use only 1 CPU at 100% every solr.autoCommit.maxTime minutes

Vangelis Katsikaros
Hi

On Mon, Feb 10, 2020 at 5:05 PM Vangelis Katsikaros <[hidden email]>
wrote:

> Hi all
>
> We run Solr 8.2.0
> * with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]),
> * on Ubuntu 18.04
> * on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
> * with -Xmx16g [1].
>
> We have migrated from Solr 3.5 and in big core (16GB) replicas we have
> started to suffer degraded service. The replica’s ReplicationHandler is in
> [8] and the master’s updateHandler in [9].
>
> We notice every 5 mins (the value for solr.autoCommit.maxTime) the
> following:
> * Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
> and the rest of the CPUs are idle (mpstat [6]). In our previous setup with
> Solr 3 we used up to 80% of all CPUs.
> * During that time the solr queries suddenly take more than 1 second, up
> to 30 sec (or more). The same queries otherwise need less than 1 sec to
> complete.
> * The disk does not seem to be a bottleneck (iostat [4]).
> * Memory does not seem to be a bottleneck (vmstat [5]).
> * CPU (apart from the single CPU issue) does not seem to be a bottleneck
> (mpstat [6] & pidstat [3]).
> * We are no java/GC experts but It does not seem to be GC related [7].
>
> We have tried reducing the heap to 8 and 2GB with no positive effect. We
> have tested different autoCommit.maxTime values. Reducing it to 60 seconds
> makes things unbearable. 5 minutes is not significantly different than 10.
> Do you have any pointers to proceed debugging the issue?
>
> Detailed example problem that repeats every solr.autoCommit.maxTime
> minutes on the replicas:
> * From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
> from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count is
> 0 [2].
> * From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr consumes
> only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
> solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
> * After 12:39:25 and until the next poll of a commit things are normal.
>
> Regards
> Vangelis
>
> [1]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
> [2]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
> [3]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
> [4]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
> [5]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
> [6]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
> [7]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
> [8]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
> [9]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler
>

Some additional information. We noticed (through the admin's "Thread Dump"
/solr/#/~threads) that whenever we see this behavior the all the threads
that block show the same stacktrace [10] and block at

org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:198)
org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:152)
org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:95)
org.apache.lucene.queries.function.valuesource.MultiFloatFunction.getValues(MultiFloatFunction.java:76)
org.apache.lucene.queries.function.ValueSource$WrappedDoubleValuesSource.getValues(ValueSource.java:203)
org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource.getValues(FunctionScoreQuery.java:255)
org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight.scorer(FunctionScoreQuery.java:218)
...

The boostfiles (external_boostvalue) are ~30M large and the schema fields
are configured in the schema [11] with:
  <field name="boostvalue" type="fileboost"/>

Regards
Vangelis

[10]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-stacktrace
[11]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-schema-boostfile
Reply | Threaded
Open this post in threaded view
|

Re: Solr 8.2 replicas use only 1 CPU at 100% every solr.autoCommit.maxTime minutes

Edward Ribeiro
Is your autoCommit configured to open new searchers? Did you try to set
openSearcher to false?

Edward

On Tue, Feb 11, 2020 at 3:40 PM Vangelis Katsikaros <[hidden email]>
wrote:

> Hi
>
> On Mon, Feb 10, 2020 at 5:05 PM Vangelis Katsikaros <[hidden email]
> >
> wrote:
>
> > Hi all
> >
> > We run Solr 8.2.0
> > * with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]),
> > * on Ubuntu 18.04
> > * on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
> > * with -Xmx16g [1].
> >
> > We have migrated from Solr 3.5 and in big core (16GB) replicas we have
> > started to suffer degraded service. The replica’s ReplicationHandler is
> in
> > [8] and the master’s updateHandler in [9].
> >
> > We notice every 5 mins (the value for solr.autoCommit.maxTime) the
> > following:
> > * Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
> > and the rest of the CPUs are idle (mpstat [6]). In our previous setup
> with
> > Solr 3 we used up to 80% of all CPUs.
> > * During that time the solr queries suddenly take more than 1 second, up
> > to 30 sec (or more). The same queries otherwise need less than 1 sec to
> > complete.
> > * The disk does not seem to be a bottleneck (iostat [4]).
> > * Memory does not seem to be a bottleneck (vmstat [5]).
> > * CPU (apart from the single CPU issue) does not seem to be a bottleneck
> > (mpstat [6] & pidstat [3]).
> > * We are no java/GC experts but It does not seem to be GC related [7].
> >
> > We have tried reducing the heap to 8 and 2GB with no positive effect. We
> > have tested different autoCommit.maxTime values. Reducing it to 60
> seconds
> > makes things unbearable. 5 minutes is not significantly different than
> 10.
> > Do you have any pointers to proceed debugging the issue?
> >
> > Detailed example problem that repeats every solr.autoCommit.maxTime
> > minutes on the replicas:
> > * From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
> > from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count
> is
> > 0 [2].
> > * From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr
> consumes
> > only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
> > solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
> > * After 12:39:25 and until the next poll of a commit things are normal.
> >
> > Regards
> > Vangelis
> >
> > [1]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
> > [2]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
> > [3]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
> > [4]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
> > [5]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
> > [6]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
> > [7]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
> > [8]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
> > [9]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler
> >
>
> Some additional information. We noticed (through the admin's "Thread Dump"
> /solr/#/~threads) that whenever we see this behavior the all the threads
> that block show the same stacktrace [10] and block at
>
>
> org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:198)
>
> org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:152)
>
> org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:95)
>
> org.apache.lucene.queries.function.valuesource.MultiFloatFunction.getValues(MultiFloatFunction.java:76)
>
> org.apache.lucene.queries.function.ValueSource$WrappedDoubleValuesSource.getValues(ValueSource.java:203)
>
> org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource.getValues(FunctionScoreQuery.java:255)
>
> org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight.scorer(FunctionScoreQuery.java:218)
> ...
>
> The boostfiles (external_boostvalue) are ~30M large and the schema fields
> are configured in the schema [11] with:
>   <field name="boostvalue" type="fileboost"/>
>
> Regards
> Vangelis
>
> [10]
>
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-stacktrace
> [11]
>
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-schema-boostfile
>