Solr Star Burst - SolrCloud Performance / Scale

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Star Burst - SolrCloud Performance / Scale

Mark Miller-3
I've always said I wanted to focus on performance and scale for SolrCloud, but for a long time that really just involved focusing on stability.

Now things have started to get pretty stable. Some things that made me cringe about SolrCloud no longer do in 7.3/7.4.

Weeks back I found myself yet again looking for spurious, ugly issues around fragile connections that cause recovery headaches and random request fails. Again I made a change that should bring big improvements. Like many times before.

I've had just about enough of that. Just about enough of broken connection reuse. Just about enough of countless wasteful threads and connections lurking and creaking all over. Just about enough of poor single update performance and weaknesses in batch updates. Just about enough of the painful ConcurrentUpdateSolrClient.

So much inefficiency hiding in plain sight. Stuff I always thought we would overcome, but always far enough in the distance to keep me from feeling bad that I didn't know quite how we would get there. Solr was a container agnostic web application before Solr 5 for god's sake. Even relatively simple changes like upgrading our http client from version 3 to 4 was a huge amount of work for very incremental improvements.

If I'm going to be excited about this system after all these years all of that has to change.

I started looking into using a HTTP/2 and a new HttpClient that can do non blocking IO async requests.

I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and difficult. Going to a fully different client has made me reconsider that. I did a lot of the work, but a good amount remains (security, finish SSL, tuning ...).

I wrote a new Http2SolrClient that can replace HttpSolrClient and plug into CloudSolrClient and LBHttpSolrClient. I added some early async APIs. Non blocking IO async is about as oversold as "schemaless", but it's a great tool to have available as well.

I'm now working in a much more efficient world, aiming for 1 connection per CoreContainer per remote destination. Connections are no longer fragile. The transfer protocol is no longer text based.

Yonik should be pleased with the state of reordered updates from leader to replica.

I replaced our CUSC usage for distributing updates with Http2SolrClient and async calls.

I played with optionally using the async calls in the HttpShardHandler as well.

I replaced all HttpSolrClient usage with Http2SolrClient.

I started to get control of threads. I had control of connections.

I added early efficient external request throttling.

I started tuning resource pools.

I started removing sleep polling loops. They are horrible and slow tests especially, we already have a replacement we are hardly using.

I did some other related stuff. I'm just fixing the main things I hate along these communication/resource-usage/scale/perf themes.

I'm calling this whole effort Star Burst: https://github.com/markrmiller/starburst

I've done a ton. Mostly very late at night, it's not all perfect yet, some of it may be exploratory. There is a lot to do to wrap it up with a bow. This touches a lot of spots, our surface area of features is just huge now.

Basically I have a high performance Solr fork at the moment (only setup for tests, not actually running stand alone Solr). I don't know how or when (or to be completely honest, if) it comes home. I'm going to do what I can, but it's likely to require more than me to be successful in a reasonable time frame.

I have a couple JIRA issues open for HTTP/2 and the new SolrClient.

Mark


--
Reply | Threaded
Open this post in threaded view
|

Re: Solr Star Burst - SolrCloud Performance / Scale

Mark Miller-3
Some of the fallout of this should be huge improvements to our tests. Right now, some of them take so long because no one even notices when they have done things to make the situation even worse and it's hard to monitor resource usage as we develop with it already fairly unbounded.

On master right now, on a lucky run (no tlog replica type for sure), BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds. Depending on how hard test injection hits, I've seen a few minutes and anywhere in between.

Setting the tlog replica issue aside (I've disabled it for the moment, but I have fixed that issue by changing out distrib commits work), on the starburst branch, resource usage with multiple parallel tests running is going to be much, much better. For single cloud tests, performance is mostly about removing naive polling and carefree resource usage. The branch has big improvements for single and parallel tests already.

I don't know how much left there is to fix, but already, on starburst, BasicDistributedZkTest takes 45 seconds vs master's 76 best case.

- Mark

On Wed, May 30, 2018 at 1:52 PM Mark Miller <[hidden email]> wrote:
I've always said I wanted to focus on performance and scale for SolrCloud, but for a long time that really just involved focusing on stability.

Now things have started to get pretty stable. Some things that made me cringe about SolrCloud no longer do in 7.3/7.4.

Weeks back I found myself yet again looking for spurious, ugly issues around fragile connections that cause recovery headaches and random request fails. Again I made a change that should bring big improvements. Like many times before.

I've had just about enough of that. Just about enough of broken connection reuse. Just about enough of countless wasteful threads and connections lurking and creaking all over. Just about enough of poor single update performance and weaknesses in batch updates. Just about enough of the painful ConcurrentUpdateSolrClient.

So much inefficiency hiding in plain sight. Stuff I always thought we would overcome, but always far enough in the distance to keep me from feeling bad that I didn't know quite how we would get there. Solr was a container agnostic web application before Solr 5 for god's sake. Even relatively simple changes like upgrading our http client from version 3 to 4 was a huge amount of work for very incremental improvements.

If I'm going to be excited about this system after all these years all of that has to change.

I started looking into using a HTTP/2 and a new HttpClient that can do non blocking IO async requests.

I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and difficult. Going to a fully different client has made me reconsider that. I did a lot of the work, but a good amount remains (security, finish SSL, tuning ...).

I wrote a new Http2SolrClient that can replace HttpSolrClient and plug into CloudSolrClient and LBHttpSolrClient. I added some early async APIs. Non blocking IO async is about as oversold as "schemaless", but it's a great tool to have available as well.

I'm now working in a much more efficient world, aiming for 1 connection per CoreContainer per remote destination. Connections are no longer fragile. The transfer protocol is no longer text based.

Yonik should be pleased with the state of reordered updates from leader to replica.

I replaced our CUSC usage for distributing updates with Http2SolrClient and async calls.

I played with optionally using the async calls in the HttpShardHandler as well.

I replaced all HttpSolrClient usage with Http2SolrClient.

I started to get control of threads. I had control of connections.

I added early efficient external request throttling.

I started tuning resource pools.

I started removing sleep polling loops. They are horrible and slow tests especially, we already have a replacement we are hardly using.

I did some other related stuff. I'm just fixing the main things I hate along these communication/resource-usage/scale/perf themes.

I'm calling this whole effort Star Burst: https://github.com/markrmiller/starburst

I've done a ton. Mostly very late at night, it's not all perfect yet, some of it may be exploratory. There is a lot to do to wrap it up with a bow. This touches a lot of spots, our surface area of features is just huge now.

Basically I have a high performance Solr fork at the moment (only setup for tests, not actually running stand alone Solr). I don't know how or when (or to be completely honest, if) it comes home. I'm going to do what I can, but it's likely to require more than me to be successful in a reasonable time frame.

I have a couple JIRA issues open for HTTP/2 and the new SolrClient.

Mark


--
--
Reply | Threaded
Open this post in threaded view
|

Re: Solr Star Burst - SolrCloud Performance / Scale

Varun Thacker-4
Hi Mark,

I've started glancing at the the repo and some of the issues you are addressing here will make things a lot more stable under high loads. I'll look at it in a little more detail in the coming days. 

The key would be how to isolate the work in desecrate chunks to then go and make Jiras for. SOLR-12405 is the first thing that caught my eye that's an isolated jira and can be tackled without the http2 client etc

On Wed, May 30, 2018 at 4:13 PM, Mark Miller <[hidden email]> wrote:
Some of the fallout of this should be huge improvements to our tests. Right now, some of them take so long because no one even notices when they have done things to make the situation even worse and it's hard to monitor resource usage as we develop with it already fairly unbounded.

On master right now, on a lucky run (no tlog replica type for sure), BasicDistributedZkTest takes my 6 core machine from 2012 takes 76 seconds. Depending on how hard test injection hits, I've seen a few minutes and anywhere in between.

Setting the tlog replica issue aside (I've disabled it for the moment, but I have fixed that issue by changing out distrib commits work), on the starburst branch, resource usage with multiple parallel tests running is going to be much, much better. For single cloud tests, performance is mostly about removing naive polling and carefree resource usage. The branch has big improvements for single and parallel tests already.

I don't know how much left there is to fix, but already, on starburst, BasicDistributedZkTest takes 45 seconds vs master's 76 best case.

- Mark

On Wed, May 30, 2018 at 1:52 PM Mark Miller <[hidden email]> wrote:
I've always said I wanted to focus on performance and scale for SolrCloud, but for a long time that really just involved focusing on stability.

Now things have started to get pretty stable. Some things that made me cringe about SolrCloud no longer do in 7.3/7.4.

Weeks back I found myself yet again looking for spurious, ugly issues around fragile connections that cause recovery headaches and random request fails. Again I made a change that should bring big improvements. Like many times before.

I've had just about enough of that. Just about enough of broken connection reuse. Just about enough of countless wasteful threads and connections lurking and creaking all over. Just about enough of poor single update performance and weaknesses in batch updates. Just about enough of the painful ConcurrentUpdateSolrClient.

So much inefficiency hiding in plain sight. Stuff I always thought we would overcome, but always far enough in the distance to keep me from feeling bad that I didn't know quite how we would get there. Solr was a container agnostic web application before Solr 5 for god's sake. Even relatively simple changes like upgrading our http client from version 3 to 4 was a huge amount of work for very incremental improvements.

If I'm going to be excited about this system after all these years all of that has to change.

I started looking into using a HTTP/2 and a new HttpClient that can do non blocking IO async requests.

I thought upgrading Apache HttpClient from 3 to 4 was long, tedious, and difficult. Going to a fully different client has made me reconsider that. I did a lot of the work, but a good amount remains (security, finish SSL, tuning ...).

I wrote a new Http2SolrClient that can replace HttpSolrClient and plug into CloudSolrClient and LBHttpSolrClient. I added some early async APIs. Non blocking IO async is about as oversold as "schemaless", but it's a great tool to have available as well.

I'm now working in a much more efficient world, aiming for 1 connection per CoreContainer per remote destination. Connections are no longer fragile. The transfer protocol is no longer text based.

Yonik should be pleased with the state of reordered updates from leader to replica.

I replaced our CUSC usage for distributing updates with Http2SolrClient and async calls.

I played with optionally using the async calls in the HttpShardHandler as well.

I replaced all HttpSolrClient usage with Http2SolrClient.

I started to get control of threads. I had control of connections.

I added early efficient external request throttling.

I started tuning resource pools.

I started removing sleep polling loops. They are horrible and slow tests especially, we already have a replacement we are hardly using.

I did some other related stuff. I'm just fixing the main things I hate along these communication/resource-usage/scale/perf themes.

I'm calling this whole effort Star Burst: https://github.com/markrmiller/starburst

I've done a ton. Mostly very late at night, it's not all perfect yet, some of it may be exploratory. There is a lot to do to wrap it up with a bow. This touches a lot of spots, our surface area of features is just huge now.

Basically I have a high performance Solr fork at the moment (only setup for tests, not actually running stand alone Solr). I don't know how or when (or to be completely honest, if) it comes home. I'm going to do what I can, but it's likely to require more than me to be successful in a reasonable time frame.

I have a couple JIRA issues open for HTTP/2 and the new SolrClient.

Mark


--
--

Reply | Threaded
Open this post in threaded view
|

Re: Solr Star Burst - SolrCloud Performance / Scale

Mark Miller-3


On Wed, May 30, 2018 at 10:18 PM Varun Thacker <[hidden email]> wrote:
Hi Mark,

I've started glancing at the the repo and some of the issues you are addressing here will make things a lot more stable under high loads. I'll look at it in a little more detail in the coming days. 

The key would be how to isolate the work in desecrate chunks to then go and make Jiras for. SOLR-12405 is the first thing that caught my eye that's an isolated jira and can be tackled without the http2 client etc

Yeah, anything that does not depend on the Jetty HttpClient or HTTP/2 can likely be brought in independently.

The Http2SolrClient can also come in without HTTP/2 or replacing HttpSolrClient and still offer non blocking IO async as a new HTTP/1.1 capable user client.

I guess I have maybe 3 JIRA issues filed - Http2SolrClient w/ Jetty HttpClient, HTTP/2, QOSFilter. That covers the foundation.

As I have gained access to these features though, all of a sudden it becomes easier to debug and solve other issues. I also learn and discover by pushing down the road. If I just very slowly put it in piece by piece and tried to pre think out every step, the results would be pretty dreary. I would not be anywhere near the current state or have the same understanding of what still needs to be done. Like SolrCloud originally, the scope of change is just too large for standard procedure. We had to fork that too and the merge back was huge and scary, but also would have only been on master.

So I'll do what I can to keep the branch up to date and we will have to pull off bitable pieces, with both HTTP/2 and Jetty HttpClient just being big and invasive no matter what, but almost all for the better :)

As soon as anyone is ready to collaborate concretely on code, let me know and I'll finish getting a base set of tests basing and move the branch to Apache.

- Mark 
--
Reply | Threaded
Open this post in threaded view
|

Re: Solr Star Burst - SolrCloud Performance / Scale

Mark Miller-3
If I just very slowly put it in piece by piece and tried to pre think out every step, the results would be pretty dreary.

To elaborate on that, there probably would not have been results from me :)

I almost quit in the middle of Jetty HttpClient. I relearned every mistake I made trying to do the proxy the first time 6 times and then made some new ones. The security and SSL part are still going to take some grunt work.

I almost quit in the middle of Http2. I hadn't signed up for this. But I was in too far by then, too much invested.

By the QOSFilter, that was a nice change of pace, but  it's just an early prototype.

It's one of those things that just doesn't happen until some idiot bites off more than he can chew. Painful to break up much initially, too general to pull lots of payed devs, too much for one dev.

I've been hunting down thread pools and bad resource use in general as well (still clearing out sleeps, focusing on non test code first, but some test code too). I'd like to get that in shape and then start enforcing checks and tests around it. A lot of that can probably come in independently.

- Mark


--