Possible memory leak after segment merge? (related to DocValues?)

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible memory leak after segment merge? (related to DocValues?)

Greg Preston
Hello,

I'm loading up our solr cloud with data (from a solrj client) and
running into a weird memory issue.  I can reliably reproduce the
problem.

- Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
- 24 solr nodes (one shard each), spread across 3 physical hosts, each
host has 256G of memory
- index and tlogs on ssd
- Xmx=7G, G1GC
- Java 1.7.0_25
- schema and solrconfig.xml attached

I'm using composite routing to route documents with the same clientId
to the same shard.  After several hours of indexing, I occasionally
see an IndexWriter go OOM.  I think that's a symptom.  When that
happens, indexing continues, and that node's tlog starts to grow.
When I notice this, I stop indexing, and bounce the problem node.
That's where it gets interesting.

Upon bouncing, the tlog replays, and then segments merge.  Once the
merging is complete, the heap is fairly full, and forced full GC only
helps a little.  But if I then bounce the node again, the heap usage
goes way down, and stays low until the next segment merge.  I believe
segment merges are also what causes the original OOM.

More details:

Index on disk for this node is ~13G, tlog is ~2.5G.
See attached mem1.png.  This is a jconsole view of the heap during the
following:

(Solr cloud node started at the left edge of this graph)

A) One CPU core pegged at 100%.  Thread dump shows:
"Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
nid=0x7a74 runnable [0x00007f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
        at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
        at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
        at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
        at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
memory freed.  Thread dump shows:
"Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
nid=0x7a74 runnable [0x00007f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
        at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
        at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
        at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
freed.  Thread dump shows:
"Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
nid=0x7a74 runnable [0x00007f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
        at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
        at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
        at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
        at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

D) One CPU core pegged at 100%.  Thread dump shows:
"Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
nid=0x7a74 runnable [0x00007f5a41c5f000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
        at org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
        at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
        at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge Thread.

F) Manually triggered full GC.  Some memory freed, but much remains.

G) Restarted solr.  Very little memory used.


Throughout all of this, there was no indexing or querying happening.
In ordered to try to determine what's using up the memory, I took a
heap dump at point (F) and analyzed it in Eclipse MAT (see attached
screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
each holding a large byte[].  By attaching a remote debugger and
re-running, it looks like there is one of these byte[] for each field
in the schema (we have several of the "dim_*" dynamic fields).  And as
far as I know, I'm not using DocValues at all.


Any hints as to what might be going on here would be greatly
appreciated.  It takes me about 10 minutes to reproduce this, so I'm
willing to try things.  I don't know enough about the internals of
solr's memory usage to proceed much further on my own.

Thank you.

-Greg

schema.xml (13K) Download Attachment
solrconfig.xml (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Joel Bernstein
Hi Greg,

I have a suspicion that the problem might be related or exacerbated be
overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
openSearcher = false. I would remove the maxDoc as well. If you try
rerunning under those commit setting it's possible the OOM errors will stop
occurring.

Joel

Joel Bernstein
Search Engineer at Heliosearch


On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <[hidden email]>wrote:

> Hello,
>
> I'm loading up our solr cloud with data (from a solrj client) and
> running into a weird memory issue.  I can reliably reproduce the
> problem.
>
> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
> host has 256G of memory
> - index and tlogs on ssd
> - Xmx=7G, G1GC
> - Java 1.7.0_25
> - schema and solrconfig.xml attached
>
> I'm using composite routing to route documents with the same clientId
> to the same shard.  After several hours of indexing, I occasionally
> see an IndexWriter go OOM.  I think that's a symptom.  When that
> happens, indexing continues, and that node's tlog starts to grow.
> When I notice this, I stop indexing, and bounce the problem node.
> That's where it gets interesting.
>
> Upon bouncing, the tlog replays, and then segments merge.  Once the
> merging is complete, the heap is fairly full, and forced full GC only
> helps a little.  But if I then bounce the node again, the heap usage
> goes way down, and stays low until the next segment merge.  I believe
> segment merges are also what causes the original OOM.
>
> More details:
>
> Index on disk for this node is ~13G, tlog is ~2.5G.
> See attached mem1.png.  This is a jconsole view of the heap during the
> following:
>
> (Solr cloud node started at the left edge of this graph)
>
> A) One CPU core pegged at 100%.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>         at
> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>         at
> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>         at
> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>         at
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> memory freed.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>         at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>         at
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> freed.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>         at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>         at
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> D) One CPU core pegged at 100%.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>         at
> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>         at
> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>         at
> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge
> Thread.
>
> F) Manually triggered full GC.  Some memory freed, but much remains.
>
> G) Restarted solr.  Very little memory used.
>
>
> Throughout all of this, there was no indexing or querying happening.
> In ordered to try to determine what's using up the memory, I took a
> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
> each holding a large byte[].  By attaching a remote debugger and
> re-running, it looks like there is one of these byte[] for each field
> in the schema (we have several of the "dim_*" dynamic fields).  And as
> far as I know, I'm not using DocValues at all.
>
>
> Any hints as to what might be going on here would be greatly
> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> willing to try things.  I don't know enough about the internals of
> solr's memory usage to proceed much further on my own.
>
> Thank you.
>
> -Greg
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Greg Preston
Hi Joel,

Thanks for the suggestion.  I could see how decreasing autoCommit time
would reduce tlog size, and how that could possibly be related to the
original OOM error.  I'm not seeing how that would make any difference
once a tlog exists, though?

I have a saved off copy of my data dir that has the 13G index and 2.5G
tlog.  So I can reproduce the replay -> merge -> memory usage issue
very quickly.  Changing the autoCommit to possibly avoid the initial
OOM will take a good bit longer to try to reproduce.  I may try that
later in the week.

-Greg


On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein <[hidden email]> wrote:

> Hi Greg,
>
> I have a suspicion that the problem might be related or exacerbated be
> overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
> openSearcher = false. I would remove the maxDoc as well. If you try
> rerunning under those commit setting it's possible the OOM errors will stop
> occurring.
>
> Joel
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <[hidden email]>wrote:
>
>> Hello,
>>
>> I'm loading up our solr cloud with data (from a solrj client) and
>> running into a weird memory issue.  I can reliably reproduce the
>> problem.
>>
>> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
>> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
>> host has 256G of memory
>> - index and tlogs on ssd
>> - Xmx=7G, G1GC
>> - Java 1.7.0_25
>> - schema and solrconfig.xml attached
>>
>> I'm using composite routing to route documents with the same clientId
>> to the same shard.  After several hours of indexing, I occasionally
>> see an IndexWriter go OOM.  I think that's a symptom.  When that
>> happens, indexing continues, and that node's tlog starts to grow.
>> When I notice this, I stop indexing, and bounce the problem node.
>> That's where it gets interesting.
>>
>> Upon bouncing, the tlog replays, and then segments merge.  Once the
>> merging is complete, the heap is fairly full, and forced full GC only
>> helps a little.  But if I then bounce the node again, the heap usage
>> goes way down, and stays low until the next segment merge.  I believe
>> segment merges are also what causes the original OOM.
>>
>> More details:
>>
>> Index on disk for this node is ~13G, tlog is ~2.5G.
>> See attached mem1.png.  This is a jconsole view of the heap during the
>> following:
>>
>> (Solr cloud node started at the left edge of this graph)
>>
>> A) One CPU core pegged at 100%.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>>         at
>> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>>         at
>> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>>         at
>> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>>         at
>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>>         at
>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>>         at
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
>> memory freed.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at
>> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>         at
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>>         at
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>         at
>> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>         at
>> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>         at
>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>         at
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
>> freed.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at
>> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>         at
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>>         at
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>         at
>> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>         at
>> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>         at
>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>         at
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> D) One CPU core pegged at 100%.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at
>> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>>         at
>> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>>         at
>> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>>         at
>> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>>         at
>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>>         at
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge
>> Thread.
>>
>> F) Manually triggered full GC.  Some memory freed, but much remains.
>>
>> G) Restarted solr.  Very little memory used.
>>
>>
>> Throughout all of this, there was no indexing or querying happening.
>> In ordered to try to determine what's using up the memory, I took a
>> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
>> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
>> each holding a large byte[].  By attaching a remote debugger and
>> re-running, it looks like there is one of these byte[] for each field
>> in the schema (we have several of the "dim_*" dynamic fields).  And as
>> far as I know, I'm not using DocValues at all.
>>
>>
>> Any hints as to what might be going on here would be greatly
>> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
>> willing to try things.  I don't know enough about the internals of
>> solr's memory usage to proceed much further on my own.
>>
>> Thank you.
>>
>> -Greg
>>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Joel Bernstein
Greg,

There is a memory component to the tlog, which supports realtime gets. This
memory component grows until there is a commit, so it will appear like a
leak. I suspect that replaying a tlog that was big enough to possibly cause
OOM is also problematic.

One thing you might want to try is going to 15 second commits, and then
kill the Solr instance between the commits. Then watch the memory as the
replaying occurs with the smaller tlog.

Joel




Joel Bernstein
Search Engineer at Heliosearch


On Mon, Dec 23, 2013 at 4:17 PM, Greg Preston <[hidden email]>wrote:

> Hi Joel,
>
> Thanks for the suggestion.  I could see how decreasing autoCommit time
> would reduce tlog size, and how that could possibly be related to the
> original OOM error.  I'm not seeing how that would make any difference
> once a tlog exists, though?
>
> I have a saved off copy of my data dir that has the 13G index and 2.5G
> tlog.  So I can reproduce the replay -> merge -> memory usage issue
> very quickly.  Changing the autoCommit to possibly avoid the initial
> OOM will take a good bit longer to try to reproduce.  I may try that
> later in the week.
>
> -Greg
>
>
> On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein <[hidden email]>
> wrote:
> > Hi Greg,
> >
> > I have a suspicion that the problem might be related or exacerbated be
> > overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
> > openSearcher = false. I would remove the maxDoc as well. If you try
> > rerunning under those commit setting it's possible the OOM errors will
> stop
> > occurring.
> >
> > Joel
> >
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
> >
> > On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <
> [hidden email]>wrote:
> >
> >> Hello,
> >>
> >> I'm loading up our solr cloud with data (from a solrj client) and
> >> running into a weird memory issue.  I can reliably reproduce the
> >> problem.
> >>
> >> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> >> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
> >> host has 256G of memory
> >> - index and tlogs on ssd
> >> - Xmx=7G, G1GC
> >> - Java 1.7.0_25
> >> - schema and solrconfig.xml attached
> >>
> >> I'm using composite routing to route documents with the same clientId
> >> to the same shard.  After several hours of indexing, I occasionally
> >> see an IndexWriter go OOM.  I think that's a symptom.  When that
> >> happens, indexing continues, and that node's tlog starts to grow.
> >> When I notice this, I stop indexing, and bounce the problem node.
> >> That's where it gets interesting.
> >>
> >> Upon bouncing, the tlog replays, and then segments merge.  Once the
> >> merging is complete, the heap is fairly full, and forced full GC only
> >> helps a little.  But if I then bounce the node again, the heap usage
> >> goes way down, and stays low until the next segment merge.  I believe
> >> segment merges are also what causes the original OOM.
> >>
> >> More details:
> >>
> >> Index on disk for this node is ~13G, tlog is ~2.5G.
> >> See attached mem1.png.  This is a jconsole view of the heap during the
> >> following:
> >>
> >> (Solr cloud node started at the left edge of this graph)
> >>
> >> A) One CPU core pegged at 100%.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
> >>         at
> >>
> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
> >>         at
> >> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
> >>         at
> >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> >> memory freed.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> >> freed.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> D) One CPU core pegged at 100%.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
> >>         at
> >>
> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
> >>         at
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
> >>         at
> >>
> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge
> >> Thread.
> >>
> >> F) Manually triggered full GC.  Some memory freed, but much remains.
> >>
> >> G) Restarted solr.  Very little memory used.
> >>
> >>
> >> Throughout all of this, there was no indexing or querying happening.
> >> In ordered to try to determine what's using up the memory, I took a
> >> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> >> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
> >> each holding a large byte[].  By attaching a remote debugger and
> >> re-running, it looks like there is one of these byte[] for each field
> >> in the schema (we have several of the "dim_*" dynamic fields).  And as
> >> far as I know, I'm not using DocValues at all.
> >>
> >>
> >> Any hints as to what might be going on here would be greatly
> >> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> >> willing to try things.  I don't know enough about the internals of
> >> solr's memory usage to proceed much further on my own.
> >>
> >> Thank you.
> >>
> >> -Greg
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Greg Preston
Interesting.  In my original post, the memory growth (during restart)
occurs after the tlog is done replaying, but during the merge.

-Greg


On Mon, Dec 23, 2013 at 2:06 PM, Joel Bernstein <[hidden email]> wrote:

> Greg,
>
> There is a memory component to the tlog, which supports realtime gets. This
> memory component grows until there is a commit, so it will appear like a
> leak. I suspect that replaying a tlog that was big enough to possibly cause
> OOM is also problematic.
>
> One thing you might want to try is going to 15 second commits, and then
> kill the Solr instance between the commits. Then watch the memory as the
> replaying occurs with the smaller tlog.
>
> Joel
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Mon, Dec 23, 2013 at 4:17 PM, Greg Preston <[hidden email]>wrote:
>
>> Hi Joel,
>>
>> Thanks for the suggestion.  I could see how decreasing autoCommit time
>> would reduce tlog size, and how that could possibly be related to the
>> original OOM error.  I'm not seeing how that would make any difference
>> once a tlog exists, though?
>>
>> I have a saved off copy of my data dir that has the 13G index and 2.5G
>> tlog.  So I can reproduce the replay -> merge -> memory usage issue
>> very quickly.  Changing the autoCommit to possibly avoid the initial
>> OOM will take a good bit longer to try to reproduce.  I may try that
>> later in the week.
>>
>> -Greg
>>
>>
>> On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein <[hidden email]>
>> wrote:
>> > Hi Greg,
>> >
>> > I have a suspicion that the problem might be related or exacerbated be
>> > overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
>> > openSearcher = false. I would remove the maxDoc as well. If you try
>> > rerunning under those commit setting it's possible the OOM errors will
>> stop
>> > occurring.
>> >
>> > Joel
>> >
>> > Joel Bernstein
>> > Search Engineer at Heliosearch
>> >
>> >
>> > On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <
>> [hidden email]>wrote:
>> >
>> >> Hello,
>> >>
>> >> I'm loading up our solr cloud with data (from a solrj client) and
>> >> running into a weird memory issue.  I can reliably reproduce the
>> >> problem.
>> >>
>> >> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
>> >> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
>> >> host has 256G of memory
>> >> - index and tlogs on ssd
>> >> - Xmx=7G, G1GC
>> >> - Java 1.7.0_25
>> >> - schema and solrconfig.xml attached
>> >>
>> >> I'm using composite routing to route documents with the same clientId
>> >> to the same shard.  After several hours of indexing, I occasionally
>> >> see an IndexWriter go OOM.  I think that's a symptom.  When that
>> >> happens, indexing continues, and that node's tlog starts to grow.
>> >> When I notice this, I stop indexing, and bounce the problem node.
>> >> That's where it gets interesting.
>> >>
>> >> Upon bouncing, the tlog replays, and then segments merge.  Once the
>> >> merging is complete, the heap is fairly full, and forced full GC only
>> >> helps a little.  But if I then bounce the node again, the heap usage
>> >> goes way down, and stays low until the next segment merge.  I believe
>> >> segment merges are also what causes the original OOM.
>> >>
>> >> More details:
>> >>
>> >> Index on disk for this node is ~13G, tlog is ~2.5G.
>> >> See attached mem1.png.  This is a jconsole view of the heap during the
>> >> following:
>> >>
>> >> (Solr cloud node started at the left edge of this graph)
>> >>
>> >> A) One CPU core pegged at 100%.  Thread dump shows:
>> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
>> >>    java.lang.Thread.State: RUNNABLE
>> >>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>> >>         at
>> >>
>> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>> >>         at
>> >> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>> >>         at
>> >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>> >>         at
>> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>> >>         at
>> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>> >>
>> >> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
>> >> memory freed.  Thread dump shows:
>> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
>> >>    java.lang.Thread.State: RUNNABLE
>> >>         at
>> >>
>> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>> >>         at
>> >>
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>> >>         at
>> >>
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>> >>         at
>> >>
>> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>> >>         at
>> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>> >>         at
>> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>> >>
>> >> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
>> >> freed.  Thread dump shows:
>> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
>> >>    java.lang.Thread.State: RUNNABLE
>> >>         at
>> >>
>> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>> >>         at
>> >>
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>> >>         at
>> >>
>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>> >>         at
>> >>
>> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>> >>         at
>> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>> >>         at
>> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>> >>
>> >> D) One CPU core pegged at 100%.  Thread dump shows:
>> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
>> >>    java.lang.Thread.State: RUNNABLE
>> >>         at
>> >>
>> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>> >>         at
>> >>
>> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>> >>         at
>> >>
>> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>> >>         at
>> >>
>> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>> >>         at
>> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>> >>         at
>> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>> >>         at
>> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>> >>         at
>> >>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>> >>
>> >> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge
>> >> Thread.
>> >>
>> >> F) Manually triggered full GC.  Some memory freed, but much remains.
>> >>
>> >> G) Restarted solr.  Very little memory used.
>> >>
>> >>
>> >> Throughout all of this, there was no indexing or querying happening.
>> >> In ordered to try to determine what's using up the memory, I took a
>> >> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
>> >> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
>> >> each holding a large byte[].  By attaching a remote debugger and
>> >> re-running, it looks like there is one of these byte[] for each field
>> >> in the schema (we have several of the "dim_*" dynamic fields).  And as
>> >> far as I know, I'm not using DocValues at all.
>> >>
>> >>
>> >> Any hints as to what might be going on here would be greatly
>> >> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
>> >> willing to try things.  I don't know enough about the internals of
>> >> solr's memory usage to proceed much further on my own.
>> >>
>> >> Thank you.
>> >>
>> >> -Greg
>> >>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Joel Bernstein
Yeah, sounds like a leak might be there. Having the huge tlog might have
just magnified it's importance.

Joel Bernstein
Search Engineer at Heliosearch


On Mon, Dec 23, 2013 at 5:17 PM, Greg Preston <[hidden email]>wrote:

> Interesting.  In my original post, the memory growth (during restart)
> occurs after the tlog is done replaying, but during the merge.
>
> -Greg
>
>
> On Mon, Dec 23, 2013 at 2:06 PM, Joel Bernstein <[hidden email]>
> wrote:
> > Greg,
> >
> > There is a memory component to the tlog, which supports realtime gets.
> This
> > memory component grows until there is a commit, so it will appear like a
> > leak. I suspect that replaying a tlog that was big enough to possibly
> cause
> > OOM is also problematic.
> >
> > One thing you might want to try is going to 15 second commits, and then
> > kill the Solr instance between the commits. Then watch the memory as the
> > replaying occurs with the smaller tlog.
> >
> > Joel
> >
> >
> >
> >
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
> >
> > On Mon, Dec 23, 2013 at 4:17 PM, Greg Preston <
> [hidden email]>wrote:
> >
> >> Hi Joel,
> >>
> >> Thanks for the suggestion.  I could see how decreasing autoCommit time
> >> would reduce tlog size, and how that could possibly be related to the
> >> original OOM error.  I'm not seeing how that would make any difference
> >> once a tlog exists, though?
> >>
> >> I have a saved off copy of my data dir that has the 13G index and 2.5G
> >> tlog.  So I can reproduce the replay -> merge -> memory usage issue
> >> very quickly.  Changing the autoCommit to possibly avoid the initial
> >> OOM will take a good bit longer to try to reproduce.  I may try that
> >> later in the week.
> >>
> >> -Greg
> >>
> >>
> >> On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein <[hidden email]>
> >> wrote:
> >> > Hi Greg,
> >> >
> >> > I have a suspicion that the problem might be related or exacerbated be
> >> > overly large tlogs. Can you adjust your autoCommits to 15 seconds.
> Leave
> >> > openSearcher = false. I would remove the maxDoc as well. If you try
> >> > rerunning under those commit setting it's possible the OOM errors will
> >> stop
> >> > occurring.
> >> >
> >> > Joel
> >> >
> >> > Joel Bernstein
> >> > Search Engineer at Heliosearch
> >> >
> >> >
> >> > On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <
> >> [hidden email]>wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I'm loading up our solr cloud with data (from a solrj client) and
> >> >> running into a weird memory issue.  I can reliably reproduce the
> >> >> problem.
> >> >>
> >> >> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> >> >> - 24 solr nodes (one shard each), spread across 3 physical hosts,
> each
> >> >> host has 256G of memory
> >> >> - index and tlogs on ssd
> >> >> - Xmx=7G, G1GC
> >> >> - Java 1.7.0_25
> >> >> - schema and solrconfig.xml attached
> >> >>
> >> >> I'm using composite routing to route documents with the same clientId
> >> >> to the same shard.  After several hours of indexing, I occasionally
> >> >> see an IndexWriter go OOM.  I think that's a symptom.  When that
> >> >> happens, indexing continues, and that node's tlog starts to grow.
> >> >> When I notice this, I stop indexing, and bounce the problem node.
> >> >> That's where it gets interesting.
> >> >>
> >> >> Upon bouncing, the tlog replays, and then segments merge.  Once the
> >> >> merging is complete, the heap is fairly full, and forced full GC only
> >> >> helps a little.  But if I then bounce the node again, the heap usage
> >> >> goes way down, and stays low until the next segment merge.  I believe
> >> >> segment merges are also what causes the original OOM.
> >> >>
> >> >> More details:
> >> >>
> >> >> Index on disk for this node is ~13G, tlog is ~2.5G.
> >> >> See attached mem1.png.  This is a jconsole view of the heap during
> the
> >> >> following:
> >> >>
> >> >> (Solr cloud node started at the left edge of this graph)
> >> >>
> >> >> A) One CPU core pegged at 100%.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
> >> >>         at
> >> >> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
> >> >>         at
> >> >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
> >> >>         at
> >> >>
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> >> >> memory freed.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >> >>         at
> >> >>
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> >> >> freed.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >> >>         at
> >> >>
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> D) One CPU core pegged at 100%.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene
> Merge
> >> >> Thread.
> >> >>
> >> >> F) Manually triggered full GC.  Some memory freed, but much remains.
> >> >>
> >> >> G) Restarted solr.  Very little memory used.
> >> >>
> >> >>
> >> >> Throughout all of this, there was no indexing or querying happening.
> >> >> In ordered to try to determine what's using up the memory, I took a
> >> >> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> >> >> screenshot).  This shows 311 instances of
> Lucene42DocValuesProducer$3,
> >> >> each holding a large byte[].  By attaching a remote debugger and
> >> >> re-running, it looks like there is one of these byte[] for each field
> >> >> in the schema (we have several of the "dim_*" dynamic fields).  And
> as
> >> >> far as I know, I'm not using DocValues at all.
> >> >>
> >> >>
> >> >> Any hints as to what might be going on here would be greatly
> >> >> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> >> >> willing to try things.  I don't know enough about the internals of
> >> >> solr's memory usage to proceed much further on my own.
> >> >>
> >> >> Thank you.
> >> >>
> >> >> -Greg
> >> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Greg Preston
In reply to this post by Greg Preston
Does anybody with knowledge of solr internals know why I'm seeing
instances of Lucene42DocValuesProducer when I don't have any fields
that are using DocValues?  Or am I misunderstanding what this class is
for?

-Greg


On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
<[hidden email]> wrote:

> Hello,
>
> I'm loading up our solr cloud with data (from a solrj client) and
> running into a weird memory issue.  I can reliably reproduce the
> problem.
>
> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
> host has 256G of memory
> - index and tlogs on ssd
> - Xmx=7G, G1GC
> - Java 1.7.0_25
> - schema and solrconfig.xml attached
>
> I'm using composite routing to route documents with the same clientId
> to the same shard.  After several hours of indexing, I occasionally
> see an IndexWriter go OOM.  I think that's a symptom.  When that
> happens, indexing continues, and that node's tlog starts to grow.
> When I notice this, I stop indexing, and bounce the problem node.
> That's where it gets interesting.
>
> Upon bouncing, the tlog replays, and then segments merge.  Once the
> merging is complete, the heap is fairly full, and forced full GC only
> helps a little.  But if I then bounce the node again, the heap usage
> goes way down, and stays low until the next segment merge.  I believe
> segment merges are also what causes the original OOM.
>
> More details:
>
> Index on disk for this node is ~13G, tlog is ~2.5G.
> See attached mem1.png.  This is a jconsole view of the heap during the
> following:
>
> (Solr cloud node started at the left edge of this graph)
>
> A) One CPU core pegged at 100%.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>         at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>         at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>         at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> memory freed.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> freed.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> D) One CPU core pegged at 100%.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>         at org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>         at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge Thread.
>
> F) Manually triggered full GC.  Some memory freed, but much remains.
>
> G) Restarted solr.  Very little memory used.
>
>
> Throughout all of this, there was no indexing or querying happening.
> In ordered to try to determine what's using up the memory, I took a
> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
> each holding a large byte[].  By attaching a remote debugger and
> re-running, it looks like there is one of these byte[] for each field
> in the schema (we have several of the "dim_*" dynamic fields).  And as
> far as I know, I'm not using DocValues at all.
>
>
> Any hints as to what might be going on here would be greatly
> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> willing to try things.  I don't know enough about the internals of
> solr's memory usage to proceed much further on my own.
>
> Thank you.
>
> -Greg
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Michael McCandless-2
Likely this is for field norms, which use doc values under the hood.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
<[hidden email]> wrote:

> Does anybody with knowledge of solr internals know why I'm seeing
> instances of Lucene42DocValuesProducer when I don't have any fields
> that are using DocValues?  Or am I misunderstanding what this class is
> for?
>
> -Greg
>
>
> On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
> <[hidden email]> wrote:
>> Hello,
>>
>> I'm loading up our solr cloud with data (from a solrj client) and
>> running into a weird memory issue.  I can reliably reproduce the
>> problem.
>>
>> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
>> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
>> host has 256G of memory
>> - index and tlogs on ssd
>> - Xmx=7G, G1GC
>> - Java 1.7.0_25
>> - schema and solrconfig.xml attached
>>
>> I'm using composite routing to route documents with the same clientId
>> to the same shard.  After several hours of indexing, I occasionally
>> see an IndexWriter go OOM.  I think that's a symptom.  When that
>> happens, indexing continues, and that node's tlog starts to grow.
>> When I notice this, I stop indexing, and bounce the problem node.
>> That's where it gets interesting.
>>
>> Upon bouncing, the tlog replays, and then segments merge.  Once the
>> merging is complete, the heap is fairly full, and forced full GC only
>> helps a little.  But if I then bounce the node again, the heap usage
>> goes way down, and stays low until the next segment merge.  I believe
>> segment merges are also what causes the original OOM.
>>
>> More details:
>>
>> Index on disk for this node is ~13G, tlog is ~2.5G.
>> See attached mem1.png.  This is a jconsole view of the heap during the
>> following:
>>
>> (Solr cloud node started at the left edge of this graph)
>>
>> A) One CPU core pegged at 100%.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>>         at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>>         at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>>         at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
>> memory freed.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
>> freed.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> D) One CPU core pegged at 100%.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>>         at org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>>         at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge Thread.
>>
>> F) Manually triggered full GC.  Some memory freed, but much remains.
>>
>> G) Restarted solr.  Very little memory used.
>>
>>
>> Throughout all of this, there was no indexing or querying happening.
>> In ordered to try to determine what's using up the memory, I took a
>> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
>> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
>> each holding a large byte[].  By attaching a remote debugger and
>> re-running, it looks like there is one of these byte[] for each field
>> in the schema (we have several of the "dim_*" dynamic fields).  And as
>> far as I know, I'm not using DocValues at all.
>>
>>
>> Any hints as to what might be going on here would be greatly
>> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
>> willing to try things.  I don't know enough about the internals of
>> solr's memory usage to proceed much further on my own.
>>
>> Thank you.
>>
>> -Greg
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Greg Preston
Interesting.  I'm not using score at all (all searches have an
explicit sort defined).  I'll try setting omit norms on all my fields
and see if I can reproduce.

Thanks.

-Greg


On Fri, Dec 27, 2013 at 4:25 AM, Michael McCandless
<[hidden email]> wrote:

> Likely this is for field norms, which use doc values under the hood.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
> <[hidden email]> wrote:
>> Does anybody with knowledge of solr internals know why I'm seeing
>> instances of Lucene42DocValuesProducer when I don't have any fields
>> that are using DocValues?  Or am I misunderstanding what this class is
>> for?
>>
>> -Greg
>>
>>
>> On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
>> <[hidden email]> wrote:
>>> Hello,
>>>
>>> I'm loading up our solr cloud with data (from a solrj client) and
>>> running into a weird memory issue.  I can reliably reproduce the
>>> problem.
>>>
>>> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
>>> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
>>> host has 256G of memory
>>> - index and tlogs on ssd
>>> - Xmx=7G, G1GC
>>> - Java 1.7.0_25
>>> - schema and solrconfig.xml attached
>>>
>>> I'm using composite routing to route documents with the same clientId
>>> to the same shard.  After several hours of indexing, I occasionally
>>> see an IndexWriter go OOM.  I think that's a symptom.  When that
>>> happens, indexing continues, and that node's tlog starts to grow.
>>> When I notice this, I stop indexing, and bounce the problem node.
>>> That's where it gets interesting.
>>>
>>> Upon bouncing, the tlog replays, and then segments merge.  Once the
>>> merging is complete, the heap is fairly full, and forced full GC only
>>> helps a little.  But if I then bounce the node again, the heap usage
>>> goes way down, and stays low until the next segment merge.  I believe
>>> segment merges are also what causes the original OOM.
>>>
>>> More details:
>>>
>>> Index on disk for this node is ~13G, tlog is ~2.5G.
>>> See attached mem1.png.  This is a jconsole view of the heap during the
>>> following:
>>>
>>> (Solr cloud node started at the left edge of this graph)
>>>
>>> A) One CPU core pegged at 100%.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>>>         at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>>>         at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>>>         at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>>>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
>>> memory freed.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
>>> freed.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> D) One CPU core pegged at 100%.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>>>         at org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>>>         at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge Thread.
>>>
>>> F) Manually triggered full GC.  Some memory freed, but much remains.
>>>
>>> G) Restarted solr.  Very little memory used.
>>>
>>>
>>> Throughout all of this, there was no indexing or querying happening.
>>> In ordered to try to determine what's using up the memory, I took a
>>> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
>>> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
>>> each holding a large byte[].  By attaching a remote debugger and
>>> re-running, it looks like there is one of these byte[] for each field
>>> in the schema (we have several of the "dim_*" dynamic fields).  And as
>>> far as I know, I'm not using DocValues at all.
>>>
>>>
>>> Any hints as to what might be going on here would be greatly
>>> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
>>> willing to try things.  I don't know enough about the internals of
>>> solr's memory usage to proceed much further on my own.
>>>
>>> Thank you.
>>>
>>> -Greg
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Greg Preston
In reply to this post by Michael McCandless-2
That was it.  Setting omitNorms="true" on all fields fixed my problem.
 I left it indexing all weekend, and heap usage still looks great.

I'm still not clear why bouncing the solr instance freed up memory,
unless the in-memory structure for this norms data is lazily loaded
somehow.

Anyway, thank you very much for the suggestion.

-Greg


On Fri, Dec 27, 2013 at 4:25 AM, Michael McCandless
<[hidden email]> wrote:

> Likely this is for field norms, which use doc values under the hood.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
> <[hidden email]> wrote:
>> Does anybody with knowledge of solr internals know why I'm seeing
>> instances of Lucene42DocValuesProducer when I don't have any fields
>> that are using DocValues?  Or am I misunderstanding what this class is
>> for?
>>
>> -Greg
>>
>>
>> On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
>> <[hidden email]> wrote:
>>> Hello,
>>>
>>> I'm loading up our solr cloud with data (from a solrj client) and
>>> running into a weird memory issue.  I can reliably reproduce the
>>> problem.
>>>
>>> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
>>> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
>>> host has 256G of memory
>>> - index and tlogs on ssd
>>> - Xmx=7G, G1GC
>>> - Java 1.7.0_25
>>> - schema and solrconfig.xml attached
>>>
>>> I'm using composite routing to route documents with the same clientId
>>> to the same shard.  After several hours of indexing, I occasionally
>>> see an IndexWriter go OOM.  I think that's a symptom.  When that
>>> happens, indexing continues, and that node's tlog starts to grow.
>>> When I notice this, I stop indexing, and bounce the problem node.
>>> That's where it gets interesting.
>>>
>>> Upon bouncing, the tlog replays, and then segments merge.  Once the
>>> merging is complete, the heap is fairly full, and forced full GC only
>>> helps a little.  But if I then bounce the node again, the heap usage
>>> goes way down, and stays low until the next segment merge.  I believe
>>> segment merges are also what causes the original OOM.
>>>
>>> More details:
>>>
>>> Index on disk for this node is ~13G, tlog is ~2.5G.
>>> See attached mem1.png.  This is a jconsole view of the heap during the
>>> following:
>>>
>>> (Solr cloud node started at the left edge of this graph)
>>>
>>> A) One CPU core pegged at 100%.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>>>         at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>>>         at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>>>         at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>>>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
>>> memory freed.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
>>> freed.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> D) One CPU core pegged at 100%.  Thread dump shows:
>>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>>    java.lang.Thread.State: RUNNABLE
>>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>>>         at org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>>>         at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>>
>>> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge Thread.
>>>
>>> F) Manually triggered full GC.  Some memory freed, but much remains.
>>>
>>> G) Restarted solr.  Very little memory used.
>>>
>>>
>>> Throughout all of this, there was no indexing or querying happening.
>>> In ordered to try to determine what's using up the memory, I took a
>>> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
>>> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
>>> each holding a large byte[].  By attaching a remote debugger and
>>> re-running, it looks like there is one of these byte[] for each field
>>> in the schema (we have several of the "dim_*" dynamic fields).  And as
>>> far as I know, I'm not using DocValues at all.
>>>
>>>
>>> Any hints as to what might be going on here would be greatly
>>> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
>>> willing to try things.  I don't know enough about the internals of
>>> solr's memory usage to proceed much further on my own.
>>>
>>> Thank you.
>>>
>>> -Greg
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak after segment merge? (related to DocValues?)

Michael McCandless-2
On Mon, Dec 30, 2013 at 1:22 PM, Greg Preston
<[hidden email]> wrote:
> That was it.  Setting omitNorms="true" on all fields fixed my problem.
>  I left it indexing all weekend, and heap usage still looks great.

Good!

> I'm still not clear why bouncing the solr instance freed up memory,
> unless the in-memory structure for this norms data is lazily loaded
> somehow.

In fact it is lazily loaded, the first time a search (well,
Similarity) needs to load the norms for scoring.

> Anyway, thank you very much for the suggestion.

You're welcome.

Mike McCandless

http://blog.mikemccandless.com