Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Alex Jablonski
Hi there!

My colleague and I have run into an issue that seems to appear when running the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node) cluster. The symptoms we're observing are "connection pool shut down" exceptions in the logs and the inability to collect metrics from more than 100 nodes in the cluster.

We think we've traced down the issue to lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java . In that class, hostClientCache exists as a cache of HttpSolrClients (currently having fixed size 100) that, on evicting a client from the cache, closes the client's connection. The hostClientCache is used in createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those base URLs to the cache, and the "get" method on the cache will happily return the new additions to the cache. But on adding the 101st HttpSolrClient to the cache, the first HttpSolrClient gets evicted and closed. This repeats itself until the only open clients we have are to base URLs 201 through 300; clients for the first 200 base URLs will be returned, but will already have been closed. When we later use the result of createHttpSolrClients to collect metrics, expecting valid and open HttpSolrClients, we fail to connect when using any of those clients that have already been closed, leading to the "Connection pool shut down" exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a fixed maximum size, use `expireAfterAccess` with a timeout that's a multiple of the scrape interval (twice the scrape interval?). We wanted to confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski

Reply | Threaded
Open this post in threaded view
|

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Alex Jablonski

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <[hidden email]> wrote:
Hi there!

My colleague and I have run into an issue that seems to appear when running the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node) cluster. The symptoms we're observing are "connection pool shut down" exceptions in the logs and the inability to collect metrics from more than 100 nodes in the cluster.

We think we've traced down the issue to lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java . In that class, hostClientCache exists as a cache of HttpSolrClients (currently having fixed size 100) that, on evicting a client from the cache, closes the client's connection. The hostClientCache is used in createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those base URLs to the cache, and the "get" method on the cache will happily return the new additions to the cache. But on adding the 101st HttpSolrClient to the cache, the first HttpSolrClient gets evicted and closed. This repeats itself until the only open clients we have are to base URLs 201 through 300; clients for the first 200 base URLs will be returned, but will already have been closed. When we later use the result of createHttpSolrClients to collect metrics, expecting valid and open HttpSolrClients, we fail to connect when using any of those clients that have already been closed, leading to the "Connection pool shut down" exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a fixed maximum size, use `expireAfterAccess` with a timeout that's a multiple of the scrape interval (twice the scrape interval?). We wanted to confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski

Reply | Threaded
Open this post in threaded view
|

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Richard Goodman
Hi Alex,

This makes me really happy to see an email about this. I've been working on a little while about setting up the prometheus exporter for our clusters. Spent good amount of time setting up config, and started getting some really decent graphs in grafana on metrics we've never been able to collect before.

For our stage environment, this worked like a charm, so shortly rolled it out to our live environment. This is when I started to get into trouble. 

I too was getting the exact problem you was facing, I then decided to split out all of my config so I had one config dedicated to JVM metric collection, one dedicated to Node level metrics etc., etc., I was still getting loads of errors coming through, which confused me. 

Our clusters are typically 96 nodes, so from your report, not sure how I would be getting this issue. One theory I had was the timeouts happening on the core admin API (our indexes are range between 5gb-20gb in size each), and our clusters will typically be around 10s of TB in size. Because of this, when we have any replica state change, we notice significant delays in /solr/admin/cores , sometimes taking a few minutes to return.

Because of this, I think there is a strong connection to the core admin being a problem here, the reason for this is we have 1 unique cluster where its typically storing 30days worth of data within its collections, new day comes along, we create a collection for that day, and any collections older than 30 days get dropped. Documents within this cluster typically don't change either, so there's never really any state change, and causes the cluster to be significantly reliable for us, where our other main group of clusters go through significant amount of change a day.

I'm currently applying your patch into our build, and will deploy this and keep you updated to see if this helps. At the moment, I'm looking if there is a way to add a default to indexInfo=false to the core admin api, that could help us here (because using that makes the response time insanely fast as per usual, however, does remove some statistics)

With that though, its very experimental, and not sure if it's the best approach, but you have to start somewhere right?

I'd be keen to look into this issue with you, as it's been a problem for us also.

I'll reply again with any results I find from applying your patch.

Cheers,

On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <[hidden email]> wrote:

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <[hidden email]> wrote:
Hi there!

My colleague and I have run into an issue that seems to appear when running the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node) cluster. The symptoms we're observing are "connection pool shut down" exceptions in the logs and the inability to collect metrics from more than 100 nodes in the cluster.

We think we've traced down the issue to lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java . In that class, hostClientCache exists as a cache of HttpSolrClients (currently having fixed size 100) that, on evicting a client from the cache, closes the client's connection. The hostClientCache is used in createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those base URLs to the cache, and the "get" method on the cache will happily return the new additions to the cache. But on adding the 101st HttpSolrClient to the cache, the first HttpSolrClient gets evicted and closed. This repeats itself until the only open clients we have are to base URLs 201 through 300; clients for the first 200 base URLs will be returned, but will already have been closed. When we later use the result of createHttpSolrClients to collect metrics, expecting valid and open HttpSolrClients, we fail to connect when using any of those clients that have already been closed, leading to the "Connection pool shut down" exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a fixed maximum size, use `expireAfterAccess` with a timeout that's a multiple of the scrape interval (twice the scrape interval?). We wanted to confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY

Reply | Threaded
Open this post in threaded view
|

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Alex Jablonski
Hey Richard,

I'd definitely love to hear whether this improves things for you. According to Guava's documentation, the cache can start evicting items when it gets close to the limit (https://github.com/google/guava/wiki/CachesExplained#size-based-eviction), not just when it reaches it, so if this does end up helping you out, that could be the reason. I haven't dug into the implementation of "close to the maximum" that Guava's cache uses, or whether that would happen in the course of building up the maps to ping all of the nodes (which is where the issue for us seemed to be), but it's at least a possible avenue to explore further.

In any case, thanks for trying this out!

On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <[hidden email]> wrote:
Hi Alex,

This makes me really happy to see an email about this. I've been working on a little while about setting up the prometheus exporter for our clusters. Spent good amount of time setting up config, and started getting some really decent graphs in grafana on metrics we've never been able to collect before.

For our stage environment, this worked like a charm, so shortly rolled it out to our live environment. This is when I started to get into trouble. 

I too was getting the exact problem you was facing, I then decided to split out all of my config so I had one config dedicated to JVM metric collection, one dedicated to Node level metrics etc., etc., I was still getting loads of errors coming through, which confused me. 

Our clusters are typically 96 nodes, so from your report, not sure how I would be getting this issue. One theory I had was the timeouts happening on the core admin API (our indexes are range between 5gb-20gb in size each), and our clusters will typically be around 10s of TB in size. Because of this, when we have any replica state change, we notice significant delays in /solr/admin/cores , sometimes taking a few minutes to return.

Because of this, I think there is a strong connection to the core admin being a problem here, the reason for this is we have 1 unique cluster where its typically storing 30days worth of data within its collections, new day comes along, we create a collection for that day, and any collections older than 30 days get dropped. Documents within this cluster typically don't change either, so there's never really any state change, and causes the cluster to be significantly reliable for us, where our other main group of clusters go through significant amount of change a day.

I'm currently applying your patch into our build, and will deploy this and keep you updated to see if this helps. At the moment, I'm looking if there is a way to add a default to indexInfo=false to the core admin api, that could help us here (because using that makes the response time insanely fast as per usual, however, does remove some statistics)

With that though, its very experimental, and not sure if it's the best approach, but you have to start somewhere right?

I'd be keen to look into this issue with you, as it's been a problem for us also.

I'll reply again with any results I find from applying your patch.

Cheers,

On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <[hidden email]> wrote:

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <[hidden email]> wrote:
Hi there!

My colleague and I have run into an issue that seems to appear when running the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node) cluster. The symptoms we're observing are "connection pool shut down" exceptions in the logs and the inability to collect metrics from more than 100 nodes in the cluster.

We think we've traced down the issue to lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java . In that class, hostClientCache exists as a cache of HttpSolrClients (currently having fixed size 100) that, on evicting a client from the cache, closes the client's connection. The hostClientCache is used in createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those base URLs to the cache, and the "get" method on the cache will happily return the new additions to the cache. But on adding the 101st HttpSolrClient to the cache, the first HttpSolrClient gets evicted and closed. This repeats itself until the only open clients we have are to base URLs 201 through 300; clients for the first 200 base URLs will be returned, but will already have been closed. When we later use the result of createHttpSolrClients to collect metrics, expecting valid and open HttpSolrClients, we fail to connect when using any of those clients that have already been closed, leading to the "Connection pool shut down" exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a fixed maximum size, use `expireAfterAccess` with a timeout that's a multiple of the scrape interval (twice the scrape interval?). We wanted to confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY

Reply | Threaded
Open this post in threaded view
|

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Richard Goodman
Hey Serj,

So I've applied your patch, deployed the latest build of the solr-prometheus exporter, and left it running for a little bit. It's had some interesting results.

As I mentioned, I split up my config from 1 monolithic config to separate configs, and have an instance running for each one (they are split up by groups defined in the /solr/admin/metrics endpoint), for more information, I have split them up via the following:
  • Core 
  • Node
  • JVM
  • Jetty
And I was getting the connection pool shut down, for every single type. However, this is not the case any more.

Core
Whilst I am no longer getting Connection Pool Shutdown, I am now getting the likes of;
ERROR - 2019-11-22 16:44:17.522; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/metrics Timeout occured while waiting response from server at: http://127.0.0.1:8085/solr

Because of this, I am now no longer collecting any core metrics at all. I'm yet to figure out in the code where this error is reported, and seeing if anything can help with this.

Node
I did get a NPE that I've never seen before, however, this appeared to be a 1 off, and now this is also reporting 0 connection pool shut down issues, which I was getting previously. I'll attach a dump of the node NPE below as like an "appendix"

JVM
Completely seems to resolve the issue, have no longer been getting it.

Jetty
Unfortunately here, I am still getting connection pool shut down errors, again I will upload a dump of the error in the "appendix"

Whilst this hasn't appeared to completely fix the issue, it's definitely a step in the right direction. As I said, for Node and JVM I am no longer getting this, and have seen an improve in the amount of metrics being collected. But it looks like there is still some work we can look at and do

--------
Appendix

Node NPE Error
ERROR - 2019-11-22 16:46:38.594; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/metrics Timeout occured while waiting response from server at: http://127.0.0.1:8083/solr
WARN  - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async; Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.NullPointerException
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_141]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_141]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.NullPointerException
        at org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_141]
        ... 4 more


Jetty Connection Pool Shut down
WARN  - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async; Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Connection pool shut down
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987
b - jenkins - 2019-11-22 16:23:03]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_141]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_141]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_141]
        ... 4 more
INFO  - 2019-11-22 16:41:57.657; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection

On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <[hidden email]> wrote:
Hey Richard,

I'd definitely love to hear whether this improves things for you. According to Guava's documentation, the cache can start evicting items when it gets close to the limit (https://github.com/google/guava/wiki/CachesExplained#size-based-eviction), not just when it reaches it, so if this does end up helping you out, that could be the reason. I haven't dug into the implementation of "close to the maximum" that Guava's cache uses, or whether that would happen in the course of building up the maps to ping all of the nodes (which is where the issue for us seemed to be), but it's at least a possible avenue to explore further.

In any case, thanks for trying this out!

On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <[hidden email]> wrote:
Hi Alex,

This makes me really happy to see an email about this. I've been working on a little while about setting up the prometheus exporter for our clusters. Spent good amount of time setting up config, and started getting some really decent graphs in grafana on metrics we've never been able to collect before.

For our stage environment, this worked like a charm, so shortly rolled it out to our live environment. This is when I started to get into trouble. 

I too was getting the exact problem you was facing, I then decided to split out all of my config so I had one config dedicated to JVM metric collection, one dedicated to Node level metrics etc., etc., I was still getting loads of errors coming through, which confused me. 

Our clusters are typically 96 nodes, so from your report, not sure how I would be getting this issue. One theory I had was the timeouts happening on the core admin API (our indexes are range between 5gb-20gb in size each), and our clusters will typically be around 10s of TB in size. Because of this, when we have any replica state change, we notice significant delays in /solr/admin/cores , sometimes taking a few minutes to return.

Because of this, I think there is a strong connection to the core admin being a problem here, the reason for this is we have 1 unique cluster where its typically storing 30days worth of data within its collections, new day comes along, we create a collection for that day, and any collections older than 30 days get dropped. Documents within this cluster typically don't change either, so there's never really any state change, and causes the cluster to be significantly reliable for us, where our other main group of clusters go through significant amount of change a day.

I'm currently applying your patch into our build, and will deploy this and keep you updated to see if this helps. At the moment, I'm looking if there is a way to add a default to indexInfo=false to the core admin api, that could help us here (because using that makes the response time insanely fast as per usual, however, does remove some statistics)

With that though, its very experimental, and not sure if it's the best approach, but you have to start somewhere right?

I'd be keen to look into this issue with you, as it's been a problem for us also.

I'll reply again with any results I find from applying your patch.

Cheers,

On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <[hidden email]> wrote:

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <[hidden email]> wrote:
Hi there!

My colleague and I have run into an issue that seems to appear when running the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node) cluster. The symptoms we're observing are "connection pool shut down" exceptions in the logs and the inability to collect metrics from more than 100 nodes in the cluster.

We think we've traced down the issue to lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java . In that class, hostClientCache exists as a cache of HttpSolrClients (currently having fixed size 100) that, on evicting a client from the cache, closes the client's connection. The hostClientCache is used in createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those base URLs to the cache, and the "get" method on the cache will happily return the new additions to the cache. But on adding the 101st HttpSolrClient to the cache, the first HttpSolrClient gets evicted and closed. This repeats itself until the only open clients we have are to base URLs 201 through 300; clients for the first 200 base URLs will be returned, but will already have been closed. When we later use the result of createHttpSolrClients to collect metrics, expecting valid and open HttpSolrClients, we fail to connect when using any of those clients that have already been closed, leading to the "Connection pool shut down" exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a fixed maximum size, use `expireAfterAccess` with a timeout that's a multiple of the scrape interval (twice the scrape interval?). We wanted to confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY

Reply | Threaded
Open this post in threaded view
|

Re: Solr Prometheus Exporter failing with "Connection pool shut down" on large cluster

Richard Goodman
Sorry, I meant to say, Hey Alex

It's because this time around I did reply all, and Serj's name was highlighted in my email!

On Fri, 22 Nov 2019 at 17:17, Richard Goodman <[hidden email]> wrote:
Hey Serj,

So I've applied your patch, deployed the latest build of the solr-prometheus exporter, and left it running for a little bit. It's had some interesting results.

As I mentioned, I split up my config from 1 monolithic config to separate configs, and have an instance running for each one (they are split up by groups defined in the /solr/admin/metrics endpoint), for more information, I have split them up via the following:
  • Core 
  • Node
  • JVM
  • Jetty
And I was getting the connection pool shut down, for every single type. However, this is not the case any more.

Core
Whilst I am no longer getting Connection Pool Shutdown, I am now getting the likes of;
ERROR - 2019-11-22 16:44:17.522; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/metrics Timeout occured while waiting response from server at: http://127.0.0.1:8085/solr

Because of this, I am now no longer collecting any core metrics at all. I'm yet to figure out in the code where this error is reported, and seeing if anything can help with this.

Node
I did get a NPE that I've never seen before, however, this appeared to be a 1 off, and now this is also reporting 0 connection pool shut down issues, which I was getting previously. I'll attach a dump of the node NPE below as like an "appendix"

JVM
Completely seems to resolve the issue, have no longer been getting it.

Jetty
Unfortunately here, I am still getting connection pool shut down errors, again I will upload a dump of the error in the "appendix"

Whilst this hasn't appeared to completely fix the issue, it's definitely a step in the right direction. As I said, for Node and JVM I am no longer getting this, and have seen an improve in the amount of metrics being collected. But it looks like there is still some work we can look at and do

--------
Appendix

Node NPE Error
ERROR - 2019-11-22 16:46:38.594; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/metrics Timeout occured while waiting response from server at: http://127.0.0.1:8083/solr
WARN  - 2019-11-22 16:46:59.105; org.apache.solr.prometheus.scraper.Async; Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.NullPointerException
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_141]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_141]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.NullPointerException
        at org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:112) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_141]
        ... 4 more


Jetty Connection Pool Shut down
WARN  - 2019-11-22 16:41:57.505; org.apache.solr.prometheus.scraper.Async; Error occurred during metrics collection
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Connection pool shut down
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987
b - jenkins - 2019-11-22 16:23:03]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_141]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_141]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_141]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_141]
        at org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_141]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_141]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191) ~[httpcore-4.4.10.jar:4.4.10]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
        at org.apache.solr.prometheus.scraper.SolrScraper.request(SolrScraper.java:102) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrCloudScraper.lambda$metricsForAllHosts$6(SolrCloudScraper.java:121) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at org.apache.solr.prometheus.scraper.SolrScraper.lambda$null$0(SolrScraper.java:81) ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_141]
        ... 4 more
INFO  - 2019-11-22 16:41:57.657; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Completed metrics collection

On Fri, 22 Nov 2019 at 16:42, Alex Jablonski <[hidden email]> wrote:
Hey Richard,

I'd definitely love to hear whether this improves things for you. According to Guava's documentation, the cache can start evicting items when it gets close to the limit (https://github.com/google/guava/wiki/CachesExplained#size-based-eviction), not just when it reaches it, so if this does end up helping you out, that could be the reason. I haven't dug into the implementation of "close to the maximum" that Guava's cache uses, or whether that would happen in the course of building up the maps to ping all of the nodes (which is where the issue for us seemed to be), but it's at least a possible avenue to explore further.

In any case, thanks for trying this out!

On Fri, Nov 22, 2019 at 10:16 AM Richard Goodman <[hidden email]> wrote:
Hi Alex,

This makes me really happy to see an email about this. I've been working on a little while about setting up the prometheus exporter for our clusters. Spent good amount of time setting up config, and started getting some really decent graphs in grafana on metrics we've never been able to collect before.

For our stage environment, this worked like a charm, so shortly rolled it out to our live environment. This is when I started to get into trouble. 

I too was getting the exact problem you was facing, I then decided to split out all of my config so I had one config dedicated to JVM metric collection, one dedicated to Node level metrics etc., etc., I was still getting loads of errors coming through, which confused me. 

Our clusters are typically 96 nodes, so from your report, not sure how I would be getting this issue. One theory I had was the timeouts happening on the core admin API (our indexes are range between 5gb-20gb in size each), and our clusters will typically be around 10s of TB in size. Because of this, when we have any replica state change, we notice significant delays in /solr/admin/cores , sometimes taking a few minutes to return.

Because of this, I think there is a strong connection to the core admin being a problem here, the reason for this is we have 1 unique cluster where its typically storing 30days worth of data within its collections, new day comes along, we create a collection for that day, and any collections older than 30 days get dropped. Documents within this cluster typically don't change either, so there's never really any state change, and causes the cluster to be significantly reliable for us, where our other main group of clusters go through significant amount of change a day.

I'm currently applying your patch into our build, and will deploy this and keep you updated to see if this helps. At the moment, I'm looking if there is a way to add a default to indexInfo=false to the core admin api, that could help us here (because using that makes the response time insanely fast as per usual, however, does remove some statistics)

With that though, its very experimental, and not sure if it's the best approach, but you have to start somewhere right?

I'd be keen to look into this issue with you, as it's been a problem for us also.

I'll reply again with any results I find from applying your patch.

Cheers,

On Wed, 20 Nov 2019 at 20:34, Alex Jablonski <[hidden email]> wrote:

Thanks!
Alex Jablonski

On Wed, Nov 20, 2019 at 1:36 PM Alex Jablonski <[hidden email]> wrote:
Hi there!

My colleague and I have run into an issue that seems to appear when running the Solr Prometheus exporter in SolrCloud mode against a large (> 100 node) cluster. The symptoms we're observing are "connection pool shut down" exceptions in the logs and the inability to collect metrics from more than 100 nodes in the cluster.

We think we've traced down the issue to lucene-solr/solr/contrib/prometheus-exporter/src/java/org/apache/solr/prometheus/scraper/SolrCloudScraper.java . In that class, hostClientCache exists as a cache of HttpSolrClients (currently having fixed size 100) that, on evicting a client from the cache, closes the client's connection. The hostClientCache is used in createHttpSolrClients to return a map of base URLs to HttpSolrClients.

Given, say, 300 base URLs, createHttpSolrClients will happily add those base URLs to the cache, and the "get" method on the cache will happily return the new additions to the cache. But on adding the 101st HttpSolrClient to the cache, the first HttpSolrClient gets evicted and closed. This repeats itself until the only open clients we have are to base URLs 201 through 300; clients for the first 200 base URLs will be returned, but will already have been closed. When we later use the result of createHttpSolrClients to collect metrics, expecting valid and open HttpSolrClients, we fail to connect when using any of those clients that have already been closed, leading to the "Connection pool shut down" exception and not collecting metrics from those nodes.

Our idea for a fix was to change the existing cache to, instead of having a fixed maximum size, use `expireAfterAccess` with a timeout that's a multiple of the scrape interval (twice the scrape interval?). We wanted to confirm a few things:

1. Has this issue been reported before, and if so, is there another fix in progress already?
2. Does this approach seem desirable?
3. If so, are there any opinions on what the cache timeout should be besides just double the scrape interval?

We'll also open a PR shortly with the changes we're proposing and link here. Please let me know if any of the above is unclear or incorrect.

Thanks!
Alex Jablonski



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY



--

Richard Goodman    |    Data Infrastructure engineer

[hidden email]



NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART | PARIS   | SINGAPORE | SYDNEY