Periodically 100% cpu and high load/IO

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Periodically 100% cpu and high load/IO

Marvin Bredal Lillehaug
Hi,
We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
or two nodes has Solr running with 100% cpu on all cores, «load» over 400,
and high IO. It usually lasts five to ten minutes, and the node is hardly
responding.
Does anyone have any experience with this type of behaviour? Is there any
logging other than infostream that could give any information?

We managed to trigger a thread dump,

> java.base@11.0.6
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)


But not sure if this is from the incident or just right after. It seems
strange that a fsync should behave like this.

Swappiness is set to default for RHEL 7 (Ops have resisted turning it off)

--
Kind regards,
Marvin B. Lillehaug
Reply | Threaded
Open this post in threaded view
|

Re: Periodically 100% cpu and high load/IO

Erick Erickson
One possibility is merging index segments. When this happens, are you actively indexing? And are these NRT replicas or TLOG/PULL? If the latter, are your TLOG leaders on the affected machines?

Best,
Erick

> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <[hidden email]> wrote:
>
> Hi,
> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
> or two nodes has Solr running with 100% cpu on all cores, «load» over 400,
> and high IO. It usually lasts five to ten minutes, and the node is hardly
> responding.
> Does anyone have any experience with this type of behaviour? Is there any
> logging other than infostream that could give any information?
>
> We managed to trigger a thread dump,
>
>> java.base@11.0.6
>> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>>
>> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>>
>> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>>
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>>
>> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>>
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
>
>
> But not sure if this is from the incident or just right after. It seems
> strange that a fsync should behave like this.
>
> Swappiness is set to default for RHEL 7 (Ops have resisted turning it off)
>
> --
> Kind regards,
> Marvin B. Lillehaug

Reply | Threaded
Open this post in threaded view
|

Re: Periodically 100% cpu and high load/IO

Marvin Bredal Lillehaug
Yes, there are light/moderate indexing most of the time.
The setup has NRT replicas. And the shards are around 45GB each.
Index merging has been the hypothesis for some time, but we haven't dared
to activate info stream logging.

On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <[hidden email]>
wrote:

> One possibility is merging index segments. When this happens, are you
> actively indexing? And are these NRT replicas or TLOG/PULL? If the latter,
> are your TLOG leaders on the affected machines?
>
> Best,
> Erick
>
> > On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
> [hidden email]> wrote:
> >
> > Hi,
> > We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
> > or two nodes has Solr running with 100% cpu on all cores, «load» over
> 400,
> > and high IO. It usually lasts five to ten minutes, and the node is hardly
> > responding.
> > Does anyone have any experience with this type of behaviour? Is there any
> > logging other than infostream that could give any information?
> >
> > We managed to trigger a thread dump,
> >
> >> java.base@11.0.6
> >>
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
> >> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
> >> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
> >> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
> >>
> >>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
> >>
> >>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
> >> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
> >>
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
> >> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
> >>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
> >>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
> >> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
> >
> >
> > But not sure if this is from the incident or just right after. It seems
> > strange that a fsync should behave like this.
> >
> > Swappiness is set to default for RHEL 7 (Ops have resisted turning it
> off)
> >
> > --
> > Kind regards,
> > Marvin B. Lillehaug
>
>

--
med vennlig hilsen,
Marvin B. Lillehaug
Reply | Threaded
Open this post in threaded view
|

Re: Periodically 100% cpu and high load/IO

Phill Campbell
Can you switch to 8.5.2 and see if it still happens.
In my testing of 8.5.1 I had one of my machines get really hot and bring the entire system to a crawl.
What seemed to cause my issue was memory usage. I could give the JVM running Solr less heap and the problem wouldn’t manifest.
I haven’t seen it with 8.5.2. Just a thought.

> On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug <[hidden email]> wrote:
>
> Yes, there are light/moderate indexing most of the time.
> The setup has NRT replicas. And the shards are around 45GB each.
> Index merging has been the hypothesis for some time, but we haven't dared
> to activate info stream logging.
>
> On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <[hidden email]>
> wrote:
>
>> One possibility is merging index segments. When this happens, are you
>> actively indexing? And are these NRT replicas or TLOG/PULL? If the latter,
>> are your TLOG leaders on the affected machines?
>>
>> Best,
>> Erick
>>
>>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
>> [hidden email]> wrote:
>>>
>>> Hi,
>>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
>>> or two nodes has Solr running with 100% cpu on all cores, «load» over
>> 400,
>>> and high IO. It usually lasts five to ten minutes, and the node is hardly
>>> responding.
>>> Does anyone have any experience with this type of behaviour? Is there any
>>> logging other than infostream that could give any information?
>>>
>>> We managed to trigger a thread dump,
>>>
>>>> java.base@11.0.6
>>>>
>> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
>>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
>>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
>>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>>>>
>>>>
>> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>>>>
>>>>
>> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
>>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>>>>
>>>>
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
>>>>
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
>>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>>>>
>>>>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>>>>
>>>>
>> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>>>>
>>>>
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
>>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
>>>
>>>
>>> But not sure if this is from the incident or just right after. It seems
>>> strange that a fsync should behave like this.
>>>
>>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
>> off)
>>>
>>> --
>>> Kind regards,
>>> Marvin B. Lillehaug
>>
>>
>
> --
> med vennlig hilsen,
> Marvin B. Lillehaug

Reply | Threaded
Open this post in threaded view
|

Re: Periodically 100% cpu and high load/IO

Marvin Bredal Lillehaug
We have upgrading 8.5.2 on the way to production, so we'll see.

We are running with default merge config, and based on the description on
https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler
I don't understand why all cpus are maxed.


On Sun, 7 Jun 2020, 16:59 Phill Campbell, <[hidden email]> wrote:

> Can you switch to 8.5.2 and see if it still happens.
> In my testing of 8.5.1 I had one of my machines get really hot and bring
> the entire system to a crawl.
> What seemed to cause my issue was memory usage. I could give the JVM
> running Solr less heap and the problem wouldn’t manifest.
> I haven’t seen it with 8.5.2. Just a thought.
>
> > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug <
> [hidden email]> wrote:
> >
> > Yes, there are light/moderate indexing most of the time.
> > The setup has NRT replicas. And the shards are around 45GB each.
> > Index merging has been the hypothesis for some time, but we haven't dared
> > to activate info stream logging.
> >
> > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> >> One possibility is merging index segments. When this happens, are you
> >> actively indexing? And are these NRT replicas or TLOG/PULL? If the
> latter,
> >> are your TLOG leaders on the affected machines?
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
> >> [hidden email]> wrote:
> >>>
> >>> Hi,
> >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes
> one
> >>> or two nodes has Solr running with 100% cpu on all cores, «load» over
> >> 400,
> >>> and high IO. It usually lasts five to ten minutes, and the node is
> hardly
> >>> responding.
> >>> Does anyone have any experience with this type of behaviour? Is there
> any
> >>> logging other than infostream that could give any information?
> >>>
> >>> We managed to trigger a thread dump,
> >>>
> >>>> java.base@11.0.6
> >>>>
> >>
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
> >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
> >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
> >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
> >>>>
> >>>>
> >>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
> >>>>
> >>>>
> >>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
> >>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
> >>>>
> >>>>
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> >>>>
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
> >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
> >>>>
> >>>>
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
> >>>>
> >>>>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
> >>>>
> >>>>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
> >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
> >>>
> >>>
> >>> But not sure if this is from the incident or just right after. It seems
> >>> strange that a fsync should behave like this.
> >>>
> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
> >> off)
> >>>
> >>> --
> >>> Kind regards,
> >>> Marvin B. Lillehaug
> >>
> >>
> >
> > --
> > med vennlig hilsen,
> > Marvin B. Lillehaug
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Periodically 100% cpu and high load/IO

Marvin Bredal Lillehaug
This may have been a bunch of simultaious queries, not index merging.

Managed to get more thread dumps, and they all have severat threads like

> "qtp385739920-1050156" #1050156 prio=5 os_prio=0 cpu=38864.53ms
> elapsed=7414.80s tid=0x00007f62080eb000 nid=0x5f8a runnable
>  [0x00007f6117480000]
>    java.lang.Thread.State: RUNNABLE
> at
> org.apache.lucene.codecs.lucene80.IndexedDISI.advance(IndexedDISI.java:384)
> at
> org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$SparseNumericDocValues.advance(Lucene80DocValuesProducer.java:434)
> at org.apache.lucene.search.ReqExclScorer$1.matches(ReqExclScorer.java:154)
> at
> org.apache.lucene.search.DisjunctionScorer$TwoPhase.matches(DisjunctionScorer.java:159)
> at
> org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:345)
> at
> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:274)
> at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:218)
> at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:661)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
> at
> org.apache.solr.search.DocSetUtil.createDocSetGeneric(DocSetUtil.java:151)
> at org.apache.solr.search.DocSetUtil.createDocSet(DocSetUtil.java:140)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1191)
> at
> org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:831)
> at
> org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1039)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1554)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1434)
> at
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:581)
> at
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1487)
> at
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:399)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)


 with queries like

> q=*:*
>
> &fl=docId,id,typeId,versjonsId,start_date,out_link_*,regulars,spatials,associations
> &fq=(typeId:945 AND exportId:hex0945)
> &fq=docType:object
> &fq=+start_date:[0 TO 20200611] +(end_date:{20200611 TO *] (*:*
> -end_date:[* TO *])

where all fields but start_date and end_date having doc values.

So I guess we'll just have to adjust our rate limiter so that Solr is not
overwhelmed.


On Sun, Jun 7, 2020 at 8:21 PM Marvin Bredal Lillehaug <
[hidden email]> wrote:

> We have upgrading 8.5.2 on the way to production, so we'll see.
>
> We are running with default merge config, and based on the description on
> https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler
> I don't understand why all cpus are maxed.
>
>
> On Sun, 7 Jun 2020, 16:59 Phill Campbell, <[hidden email]> wrote:
>
>> Can you switch to 8.5.2 and see if it still happens.
>> In my testing of 8.5.1 I had one of my machines get really hot and bring
>> the entire system to a crawl.
>> What seemed to cause my issue was memory usage. I could give the JVM
>> running Solr less heap and the problem wouldn’t manifest.
>> I haven’t seen it with 8.5.2. Just a thought.
>>
>> > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug <
>> [hidden email]> wrote:
>> >
>> > Yes, there are light/moderate indexing most of the time.
>> > The setup has NRT replicas. And the shards are around 45GB each.
>> > Index merging has been the hypothesis for some time, but we haven't
>> dared
>> > to activate info stream logging.
>> >
>> > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <[hidden email]>
>> > wrote:
>> >
>> >> One possibility is merging index segments. When this happens, are you
>> >> actively indexing? And are these NRT replicas or TLOG/PULL? If the
>> latter,
>> >> are your TLOG leaders on the affected machines?
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
>> >> [hidden email]> wrote:
>> >>>
>> >>> Hi,
>> >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes
>> one
>> >>> or two nodes has Solr running with 100% cpu on all cores, «load» over
>> >> 400,
>> >>> and high IO. It usually lasts five to ten minutes, and the node is
>> hardly
>> >>> responding.
>> >>> Does anyone have any experience with this type of behaviour? Is there
>> any
>> >>> logging other than infostream that could give any information?
>> >>>
>> >>> We managed to trigger a thread dump,
>> >>>
>> >>>> java.base@11.0.6
>> >>>>
>> >>
>> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
>> >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
>> >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
>> >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>> >>>>
>> >>>>
>> >>
>> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>> >>>>
>> >>>>
>> >>
>> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
>> >>>>
>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>> >>>>
>> >>>>
>> >>
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
>> >>>>
>> >>
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
>> >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>> >>>>
>> >>>>
>> >>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>> >>>>
>> >>>>
>> >>
>> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>> >>>>
>> >>>>
>> >>
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
>> >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
>> >>>
>> >>>
>> >>> But not sure if this is from the incident or just right after. It
>> seems
>> >>> strange that a fsync should behave like this.
>> >>>
>> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
>> >> off)
>> >>>
>> >>> --
>> >>> Kind regards,
>> >>> Marvin B. Lillehaug
>> >>
>> >>
>> >
>> > --
>> > med vennlig hilsen,
>> > Marvin B. Lillehaug
>>
>>

--
med vennlig hilsen,
Marvin B. Lillehaug