High CPU and Physical Memory Usage in solr with 4000 user load

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

High CPU and Physical Memory Usage in solr with 4000 user load

rubi.hali
This post was updated on .
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: High CPU and Physical Memory Usage in solr with 4000 user load

Shawn Heisey-2
On 2/8/2018 12:41 AM, rubi.hali wrote:
> We are using Solr-6.6.2 with one master and 4 Slaves each having 4 CPU core
> and 16GB RAM. We were doing load testing with 4000 users and 800 odd search
> keywords which resulted into 95% of CPU usage in less than 3 minutes and
> affected our QUERY Responses. There was spike in physical memory also which
> was not going down even when we stopped sending load.

How many queries per second is that sending?  Thousands of queries per
second may need more than five servers.

> Our JVM Heap given to Solr is 8G which still lefts with 8G for OS.
>
> We have 4 lakh documents in solr.

How big (disk space) is all the index data being handled by one of your
Solr servers?  Does all that index data add up to 400000 documents, or
is that just the document count for one index out of multiple?

> Our Cache Configurations done in SOLR are
>
> <documentCache class="solr.LRUCache"
>                    size="50000"
>                    initialSize="30000"
>                    autowarmCount="0"/>

This is fairly large.  May not be a problem, but that will depend on how
big each document is.

>  <filterCache class="solr.FastLRUCache"
>                  size="4096"
>                  initialSize="512"
>                  autowarmCount="20"/>

This is quite large for a filterCache.  It probably needs to be
smaller.  Be careful with autowarming on the filterCache -- sometimes
filters can execute very slowly.  I had to reduce my autowarmCount on
the filterCache to *four* in order for commits to be fast enough.

> <queryResultCache class="solr.LRUCache"
>                      size="256"
>                      initialSize="256"
>                      autowarmCount="0"/>

I would probably use a non-zero autowarmCount here.  But don't make it
TOO large.

> We have enabled autocommit
>      <autoCommit>  
>        <maxDocs>${solr.autoCommit.maxDocs:25000}</maxDocs>
>   <maxTime>${solr.autoCommit.maxTime:60000}</maxTime>
>        <openSearcher>true</openSearcher> (This is true only in case of
> slaves)
>      </autoCommit>

I would recommend that you remove maxDocs and let it just be
time-based.  Also, *all* servers should have openSearcher set to false
on autoCommit.

>  We are also doing softcommit
>  <autoSoftCommit>
>        <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime>
>  </autoSoftCommit>

This is good.  You could probably lower it to two minutes instead of five.

> Our Queries are enabled with Grouping so Query Result Cache doesnt get used.
> But still in heavy load we are seeing this behaviour which is resulting into
> high response times.
>
> Please suggest if there is any configuration mismatch or os issue which we
> should resolve for bringing down our High Response Times

If your query load is going to be high, you probably need to add more
servers.

To check whether things are optimized well on each server:  What OS is
it running on?  I would like to get some more information from your
install, but the exact method for obtaining that information will vary
depending on the OS.

Can you also provide a large solr_gc.log file so that can be analyzed?

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: High CPU and Physical Memory Usage in solr with 4000 user load

rubi.hali
This post was updated on .
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: High CPU and Physical Memory Usage in solr with 4000 user load

Shawn Heisey-2
On 2/9/2018 12:48 AM, rubi.hali wrote:
> As threads are blocked , our cpu is reaching to 90% even even when heap or
> os memory is not getting used at all.
>
> One of the BLOCKED Thread Snippet
> Most of the threads are blocked either on Jetty Connector or FieldCacheIMPL
> for DOC values.  Slave1ThreadDump5.txt
> <http://lucene.472066.n3.nabble.com/file/t493418/Slave1ThreadDump5.txt>
> Slave1ThreadDump4.txt
> <http://lucene.472066.n3.nabble.com/file/t493418/Slave1ThreadDump4.txt>

Threads being blocked shouldn't cause high CPU.  It would generally be
caused by whatever is causing the threads to be blocked.

I uploaded the Slave1ThreadDump5.txt file from the first link to the
http://fastthread.io website.

The analysis noticed that 45 threads were blocked by another thread.  At
the end of this message is the stacktrace for the thread blocking the
others.

What jumps out at me is the mention of a custom plugin for Solr in the
stacktrace.  Here's the method mentioned:

com.ril.solr.request.handler.RILMultipleQuerySearchHandler.handleRequestBody

Could there be a bug in that handler causing problems?  I can't say for
sure, but this IS something out of the ordinary.

This thread is blocking other threads, but it is itself blocked.   Says
it's waiting for 0x00007f24d881e000, but this entry does not appear
anywhere else in the thread dump.  So I can't tell what's actually
blocking it.

I also tried putting the thread dump into this page:

https://spotify.github.io/threaddump-analyzer/

The analysis there says that qtp1205044462-196 is "inconsistent" and the
mouseover for that says that the Thread is blocked on "BLOCKED (on
object monitor)" without waiting for anything.

Can't find anything definitive, but it MIGHT be blocked by java's own
internal operations.  Things like memory allocations, GC pauses, etc.

I noticed that the stacktrace includes mention of Solr's facet classes,
suggesting that these queries include facets or grouping.  Facets and
grouping can require very large amounts of heap, especially if the
fields being used do not have docValues enabled.  You may need to make
your heap larger so that there are fewer garbage collection events.

I have no idea what the custom search handler is going to do to Solr's
memory requirements.

Solr should create GC logs if you have started it with the included
script.  Can you share a GC log that includes the time the load testing
is underway and the problem appears?

Can you log onto the Linux server, run the "top" program (not htop or
any other variant), press shift-M to sort the list by memory, grab a
screenshot, and share it?

Thanks,
Shawn


========================
"qtp1205044462-196" #196 prio=5 os_prio=0 tid=0x00007f2374002800
nid=0x5171 waiting for monitor entry [0x00007f24d881e000]
    java.lang.Thread.State: BLOCKED (on object monitor)
        at
org.apache.solr.uninverting.FieldCacheImpl$Cache.get(FieldCacheImpl.java:167)
        - locked <0x0000000642cd7568> (a java.util.WeakHashMap)
        at
org.apache.solr.uninverting.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:767)
        at
org.apache.solr.uninverting.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:747)
        at
org.apache.solr.uninverting.UninvertingReader.getSortedDocValues(UninvertingReader.java:319)
        at
org.apache.lucene.index.FilterLeafReader.getSortedDocValues(FilterLeafReader.java:448)
        at org.apache.lucene.index.DocValues.getSorted(DocValues.java:262)
        at
org.apache.lucene.search.grouping.term.TermGroupFacetCollector$SV.doSetNextReader(TermGroupFacetCollector.java:128)
        at
org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:33)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:659)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
        at
org.apache.solr.request.SimpleFacets.getGroupedCounts(SimpleFacets.java:692)
        at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:476)
        at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:405)
        at
org.apache.solr.request.SimpleFacets.lambda$getFacetFieldCounts$0(SimpleFacets.java:803)
        at
org.apache.solr.request.SimpleFacets$$Lambda$207/1284277177.call(Unknown
Source)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.solr.request.SimpleFacets$3.execute(SimpleFacets.java:742)
        at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:818)
        at
org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:330)
        at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:274)
        at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:296)
        at
com.ril.solr.request.handler.RILMultipleQuerySearchHandler.handleRequestBody(RILMultipleQuerySearchHandler.java:39)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:534)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
        at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
        at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:748)

    Locked ownable synchronizers:
        - None
========================