Solr - very long core admin timeouts

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr - very long core admin timeouts

Richard Goodman
Hi there,

I went to a Solr meetup in London a month or two ago, and asked this question then, however, people weren't fully sure what the underlying issue was and were curious to know.

- Background of our tech stack

Where I work, we run just under 30 solr clusters in our live environment:
  • Most of our clusters have the upper limit capacity of around 100TB, but typically our used capacity per cluster is around 55% on average.
  • Most of our clusters are running 96 instances, if not more.
  • Each instance has a TB of storage, 
  • Most of our clusters have 322 collections, where each collection is split into 16 shards and 3-way replication
  • Each core is on average 20GB, some going up to 40GB.
  • We are currently running solr v7.7.2

We've been dealing with an issue with running Solr Cloud to this size for quite some time, whilst it's never caused a catastrophic failure, when it comes to us wanting to improve our monitoring, it can become quite a blocker for us.

- The issue

We use the core admin API a lot, to get information on core level metrics. However, if there are cores recovering, this will lock and slow down the API massively, sometimes minutes to get a response from said API.

I worked on implementing the solr prometheus exporter to provide us vision on our clusters, however, because of the core admin timeout, I am met with tons of errors within the application and unable to collect core level metrics. Similarly, if we use the metrics api for the node level metrics and core level metrics, it will take a very long time to return results.

- Any findings so far

We discovered that if you supply "&indexInfo=false" to the core api call, it will return instantly, but of course you miss some information that is quite useful. We have potentially identified where in a jstack the issue is occuring, as quoted from a colleague:

"Had a look at the code from my stack trace, as we expected it's waiting for the index to be opened to populate the info about numDocs etc, for the Core admin API call"

However, not entirely sure what the best approach to fixing this is? We were also curious if other people are getting a similar issue?

- Evidence

I created a small dev cluster in our tech stack, and loaded a 30GB backup, where each core is around 2GB big on average. I then just ran a time on a curl of the core admin api;

Call before a replica being added:
root @ rich-solr0.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores" > core_admin_pre.out
real	0m0.070s
user	0m0.008s
sys	0m0.008s

Call during a replica being added:
root @ rich-solr1.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores" > core_admin_during.out
real	0m8.130s
user	0m0.016s
sys	0m0.008s

Call after a replica being added:
root @ rich-solr1.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores" > core_admin_done.out
real	0m0.037s
user	0m0.008s
sys	0m0.012s

I know here, that the difference is only 8 seconds, however, this is only with 2GB cores, in our live environment, it's significantly larger. 

I have also collected a bunch of jstacks that are before, during and after this occurring, where a previous colleague quoted the following being a possible root cause:

"qtp511717113-1006" #1006 prio=5 os_prio=0 tid=0x00007f465800a000 nid=0xc709 in Object.wait() [0x00007f6b2f954000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2206)
        - locked <0x00000004d1d75af0> (a java.lang.Object)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1994)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1868)
        at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:339)
        at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:46)
        at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
        at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
        at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)

The stacktraces can be found here.

If anyone has any advice, that would be greatly appreciated.

Cheers,
--

Richard Goodman    

Reply | Threaded
Open this post in threaded view
|

Re: Solr - very long core admin timeouts

david.w.smiley@gmail.com
Thanks for sharing this with us and debugging this reasonably far.

If this happens when a replica is being added, then this makes sense to me because the replica isn't ready yet to respond with a searcher.  Conditionally if the replica is in recovery, then maybe don't ask for the indexInfo?

I can see room for improvement in Solr here.  The core admin stats don't actually need an IndexSearcher, it needs a DirectoryReader which is much cheaper to get.  Even SolrCore.getIndexSize creates/closes one on the fly.  I filed an issue:

~ David Smiley
Apache Lucene/Solr Search Developer


On Fri, Mar 13, 2020 at 8:50 AM Richard Goodman <[hidden email]> wrote:
Hi there,

I went to a Solr meetup in London a month or two ago, and asked this question then, however, people weren't fully sure what the underlying issue was and were curious to know.

- Background of our tech stack

Where I work, we run just under 30 solr clusters in our live environment:
  • Most of our clusters have the upper limit capacity of around 100TB, but typically our used capacity per cluster is around 55% on average.
  • Most of our clusters are running 96 instances, if not more.
  • Each instance has a TB of storage, 
  • Most of our clusters have 322 collections, where each collection is split into 16 shards and 3-way replication
  • Each core is on average 20GB, some going up to 40GB.
  • We are currently running solr v7.7.2

We've been dealing with an issue with running Solr Cloud to this size for quite some time, whilst it's never caused a catastrophic failure, when it comes to us wanting to improve our monitoring, it can become quite a blocker for us.

- The issue

We use the core admin API a lot, to get information on core level metrics. However, if there are cores recovering, this will lock and slow down the API massively, sometimes minutes to get a response from said API.

I worked on implementing the solr prometheus exporter to provide us vision on our clusters, however, because of the core admin timeout, I am met with tons of errors within the application and unable to collect core level metrics. Similarly, if we use the metrics api for the node level metrics and core level metrics, it will take a very long time to return results.

- Any findings so far

We discovered that if you supply "&indexInfo=false" to the core api call, it will return instantly, but of course you miss some information that is quite useful. We have potentially identified where in a jstack the issue is occuring, as quoted from a colleague:

"Had a look at the code from my stack trace, as we expected it's waiting for the index to be opened to populate the info about numDocs etc, for the Core admin API call"

However, not entirely sure what the best approach to fixing this is? We were also curious if other people are getting a similar issue?

- Evidence

I created a small dev cluster in our tech stack, and loaded a 30GB backup, where each core is around 2GB big on average. I then just ran a time on a curl of the core admin api;

Call before a replica being added:
root @ rich-solr0.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores" > core_admin_pre.out
real	0m0.070s
user	0m0.008s
sys	0m0.008s

Call during a replica being added:
root @ rich-solr1.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores" > core_admin_during.out
real	0m8.130s
user	0m0.016s
sys	0m0.008s

Call after a replica being added:
root @ rich-solr1.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores" > core_admin_done.out
real	0m0.037s
user	0m0.008s
sys	0m0.012s

I know here, that the difference is only 8 seconds, however, this is only with 2GB cores, in our live environment, it's significantly larger. 

I have also collected a bunch of jstacks that are before, during and after this occurring, where a previous colleague quoted the following being a possible root cause:

"qtp511717113-1006" #1006 prio=5 os_prio=0 tid=0x00007f465800a000 nid=0xc709 in Object.wait() [0x00007f6b2f954000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2206)
        - locked <0x00000004d1d75af0> (a java.lang.Object)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1994)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1868)
        at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:339)
        at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:46)
        at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
        at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
        at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)

The stacktraces can be found here.

If anyone has any advice, that would be greatly appreciated.

Cheers,
--

Richard Goodman