CommitScheduler Thread blocked due to excessive number of Merging Threads

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

CommitScheduler Thread blocked due to excessive number of Merging Threads

yasoobhaider
This post has NOT been accepted by the mailing list yet.
Hi all

My team has tasked me with upgrading Solr from the version we are using (5.4) to the latest stable version 6.6. I am stuck for a few days now on the indexing part. 

First I'll list the requirements, then all the configuration settings I have tried. 

So in total I'm indexing about 2.5million documents. The average document size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute). 

System specifications: 
RAM: 120G 
Processors: 16 

Solr configuration: 
Heap size: 80G 

------------------------------------------------------------------------------------------------------------ 
solrconfig.xml: (Relevant parts; please let me know if there's anything else you would like to look at) 

<autoCommit>
      <maxDocs>10000</maxDocs>
      <maxTime>3800000</maxTime>
      <openSearcher>true</openSearcher>
</autoCommit>

<autoSoftCommit>
      <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

<ramBufferSizeMB>5000</ramBufferSizeMB>
<maxBufferedDocs>10000</maxBufferedDocs>

<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
      <int name="maxMergeAtOnce">30</int>
      <int name="segmentsPerTier">30</int>
</mergePolicyFactory>

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
      <int name="maxMergeCount">8</int>
      <int name="maxThreadCount">7</int>
</mergeScheduler>

------------------------------------------------------------------------------------------------------------ 

The main problem: 

When I start indexing everything is good until I reach about 2 million docs, which takes ~10 hours. But then the  commitscheduler thread gets blocked. It is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs from InfoStream, I found "too many merges; stalling" message from the commitscheduler thread, post which it gets stuck in the while loop forever. 

Here's the check that's stalling our commitscheduler thread. 

while (writer.hasPendingMerges() && mergeThreadCount() >= maxMergeCount) { 
.. 
.. 
      if (verbose() && startStallTime == 0) { 
        message("    too many merges; stalling..."); 
      } 
      startStallTime = System.currentTimeMillis(); 
      doStall(); 
    } 

This is the reason I have put maxMergeCount and maxThreadCount explicitly in my solrconfig. I thought increasing the number of threads would make sure that there is always one extra thread for commit to go through. But now that I have increased the allowed number of threads, Lucene just spawns that many "Lucene Merge Thread"s and leaves none for when a commit comes along and triggers a merge. And then it gets stuck forever. 

Well, not really forever, I'm guessing that once one of the merging threads is removed (by using removeMergeThread() in CMS) the commit will go through, but for some reason, the merging is so slow that this doesn't happen (I gave this a couple of hours of time, but commit thread was still stuck). Which brings us to the second problem. 

------------------------------------------------------------------------------------------------------------ 

The second problem: 
Merging is extremely slow. I'm not sure what I'm missing here. Maybe there's a change in 6.x version which has significantly hampered merging speed. From the thread dump, what I can see is that "Lucene Merge Thread"s are in the Runnable state, and at TreeMap.getEntry() call. Is this normal? 

Another thing I noticed was that the disk IO is throttled at ~20Mb/s. But I'm not sure if this is something that can actually hamper merging. 

My index size was ~10GB and I left it overnight (~6hours) and almost no merging happened. 

Here's another infoStream message from logs. Just putting it here in case it helps. 

----- 

2017-09-06 14:11:07.921 INFO  (qtp834133664-115) [   x:collection1] o.a.s.u.LoggingInfoStream [MS][qtp834133664-115]: updateMergeThreads ioThrottle=true targetMBPerSec=23.6 MB/sec 
merge thread Lucene Merge Thread #4 estSize=5116.1 MB (written=4198.1 MB) runTime=8100.1s (stopped=0.0s, paused=142.5s) rate=19.7 MB/sec 
  now change from 19.7 MB/sec to 23.6 MB/sec 
merge thread Lucene Merge Thread #7 estSize=1414.3 MB (written=0.0 MB) runTime=0.0s (stopped=0.0s, paused=0.0s) rate=23.6 MB/sec 
  leave running at 23.6 MB/sec 
merge thread Lucene Merge Thread #5 estSize=1014.4 MB (written=427.2 MB) runTime=6341.9s (stopped=0.0s, paused=12.3s) rate=19.7 MB/sec 
  now change from 19.7 MB/sec to 23.6 MB/sec 
merge thread Lucene Merge Thread #3 estSize=752.8 MB (written=362.8 MB) runTime=8100.1s (stopped=0.0s, paused=12.4s) rate=19.7 MB/sec 
  now change from 19.7 MB/sec to 23.6 MB/sec 
merge thread Lucene Merge Thread #2 estSize=312.5 MB (written=151.9 MB) runTime=8100.7s (stopped=0.0s, paused=8.7s) rate=19.7 MB/sec 
  now change from 19.7 MB/sec to 23.6 MB/sec 
merge thread Lucene Merge Thread #6 estSize=87.7 MB (written=63.0 MB) runTime=3627.8s (stopped=0.0s, paused=0.9s) rate=19.7 MB/sec 
  now change from 19.7 MB/sec to 23.6 MB/sec 
merge thread Lucene Merge Thread #1 estSize=57.3 MB (written=21.7 MB) runTime=8101.2s (stopped=0.0s, paused=0.2s) rate=19.7 MB/sec 
  now change from 19.7 MB/sec to 23.6 MB/sec 
merge thread Lucene Merge Thread #0 estSize=4.6 MB (written=0.0 MB) runTime=8101.0s (stopped=0.0s, paused=0.0s) rate=unlimited 
  leave running at Infinity MB/sec 

----- 

I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and then to 30, in hopes of having fewer merging threads to be running at once, but that just results in more segments to be created (not sure why this would happen). I also tried going the other way by reducing it to 5, but that experiment failed quickly (commit thread blocked). 

I increased the ramBufferSizeMB to 5000MB so that there are fewer flushes happening, so that fewer segments are created, so that fewer merges happen (I haven't dug deep here, so please correct me if this is something I should revert. Our current (5.x) config has this set at 324MB). 

The autoCommit and autoSoftCommit settings look good to me, as I've turned of softCommits, and I am autoCommitting at 10000 docs (every 5-10 minutes), which finishes smoothly, unless it gets stuck in the first problem described above. 


Questions: 
1a. Why is Lucene spawning so many merging threads? 
1b. How can I make sure that there's always room for the Commit thread to go through? 
1c. Are all MergeThreads in runnable state at Treemap.getEntry() is normal? 

2a. Is merging slower in 6.x than 5.x? 
2b. What can I do to make it go faster? 
2c. Could disk IO throttling be an issue? If so, how can I resolve it? I tried providing ioThrottle=false in solrconfig but that just throws an error. 

I have been trying different things for a week now. Please let me know if there's anything else I can read/try other than the things I have pointed above. 

Regards 
Yasoob Haider
Reply | Threaded
Open this post in threaded view
|

Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

yasoobhaider
So I did a little more digging around why the merging is taking so long, and
it looks like merging postings is the culprit.

On the 5.4 version, merging 500 docs is taking approximately 100 msec, while
on the 6.6 version, it is taking more than 3000 msec. The difference seems
to get worse when more docs are being merged.

Any ideas why this may be the case?

Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

Shawn Heisey-2
On 9/7/2017 4:25 AM, yasoobhaider wrote:
> So I did a little more digging around why the merging is taking so
> long, and it looks like merging postings is the culprit. On the 5.4
> version, merging 500 docs is taking approximately 100 msec, while on
> the 6.6 version, it is taking more than 3000 msec. The difference
> seems to get worse when more docs are being merged. Any ideas why this
> may be the case?

The rest of this thread is completely lost here, I only found the info
by going to Nabble, which is a mirror of the mailing list in forum
format.  The mailing list is the canonical repository.

Setting the ramBufferSizeMB to nearly 5 gigabytes is only going to be
helpful if the docs you are indexing into Solr are enormous -- many
megabytes of text data in each one.  Testing by Solr developers has
shown that values above about 128MB do not typically provide any
performance advantage with normal sized documents.  The commit
characteristics will have more to do with how large each segment is than
the ramBufferSizeMB.  The default ramBufferSizeMB value in modern Solr
versions is 100.

Assuming we are dealing with relatively small documents, I would
recommend these settings in indexConfig (removing ramBufferSizeMB,
mergePolicyFactory, and maxBufferedDocs entirely):

<autoCommit>
      <maxTime>60000</maxTime>
      <openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
      <maxTime>600000</maxTime>
</autoSoftCommit>

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
      <int name="maxMergeCount">6</int>
      <int name="maxThreadCount">1</int>
</mergeScheduler>

If your data is on standard disks, then you want maxThreadCount at one. 
If it's on SSD, then you can raise it a little bit, but I wouldn't go
beyond about 2 or 3.  On standard disks with many threads writing merged
segments, the disk will begin thrashing excessively and I/O will slow to
a crawl.

If the documents are huge, then you can raise ramBufferSizeMB, but five
gigabytes is REALLY BIG and will require a very large heap.

If there is good reason to increase the values in mergePolicy, then this
is what I would recommend:

<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
      <int name="maxMergeAtOnce">30</int>
      <int name="segmentsPerTier">30</int>
      <int name="maxMergeAtOnceExplicit">90</int>
</mergePolicyFactory>

The settings I've described here may help, or it may do nothing.  If it
doesn't help, then the problems may be memory-related, which is a whole
separate discussion.

When Lucene says "too many merge threads, stalling" it means there are
many merges scheduled at the same time, which usually means that there
are multiple *levels* of merging scheduled -- one that combines a bunch
of initial level segments into one second level segment, one that
combines multiple second level segments into third-level segments, and
so on.  The "stalling" means that the *indexing* thread is paused until
the number of merges drops below maxMergeCount.  If this is happening
with maxMergeCount at eight, it is likely because of the current
autoCommit maxDocs setting of 10000 -- each of the initial segments are
very small, so there are a LOT of segments that need merging.  The
autoCommit and autoSoftCommit settings that I provided will hopefully
make that less of a problem.

Merging segments goes slower than the speed of your disks.  This is
because Lucene must collect a lot of information from each source
segment and combine it in memory to write a new segment.  The gathering
and combining is much slower than modern disk speeds.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

yasoobhaider
Hi Shawn

Thanks for putting the settings in context. This definitely helps.

Before I put these settings, I was doing a bit more digging, and wanted to
really understand why the merging was so slow. Looking at the thread dump of
Solr 6.6 and Solr 5.4, I found that in the merging process, the merging of
"postings" is taking a lot of time on 6.6 compared to 5.4.

In a merge containing 500 docs, it took on average 100msec on 5.4, vs
3500msec on 6.6.

I compared the source code for the two versions and found that different
merge functions were being used to merge the postings. In 5.4, the default
merge method of FieldsConsumer class was being used. While in 6.6, the
PerFieldPostingsFormat's merge method is being used. I checked and it looks
like this change went in Solr 6.3. So I replaced the 6.6 instance with 6.2.1
and re-indexed all the data, and it is working very well, even with the
settings I had initially used.

This is the issue that prompted the change:
https://issues.apache.org/jira/browse/LUCENE-7456

I plan to experiment with the settings provided by you and see if it further
helps our case. But out of curiosity I wanted to understand what is the
change in the two algorithms that it has such drastic effect on the merging
speed.

Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html