BadApple

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

BadApple

Erick Erickson
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

HOSS-2019-11-04.csv (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
You can still make progress this way. For example, all those like auth type tests, you can fix them mostly by not closing connections in solrdispatchfilter and waiting for collection creates, and maybe a security file in zk watcher thing. 

You can still make progress this way, you just prob won’t ever get to the end of the road per say. 

On Mon, Nov 4, 2019 at 6:48 AM Erick Erickson <[hidden email]> wrote:
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
Like one of the biggest things those auth guys do is just raw response send error. Override our response where it does close shield and make it so those sendErrors actually just send a normal response format back to Solr with the proper error code instead of raw send error and closing the connection. Just that will help those security tests a lot. 

On Wed, Nov 6, 2019 at 9:19 AM Mark Miller <[hidden email]> wrote:
You can still make progress this way. For example, all those like auth type tests, you can fix them mostly by not closing connections in solrdispatchfilter and waiting for collection creates, and maybe a security file in zk watcher thing. 

You can still make progress this way, you just prob won’t ever get to the end of the road per say. 

On Mon, Nov 4, 2019 at 6:48 AM Erick Erickson <[hidden email]> wrote:
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
--
--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
I mean honestly, if you just ignore me and fix the the smaller list of critical things that affect like everything - that will make the system at least look like it’s almost good. 

On Wed, Nov 6, 2019 at 9:22 AM Mark Miller <[hidden email]> wrote:
Like one of the biggest things those auth guys do is just raw response send error. Override our response where it does close shield and make it so those sendErrors actually just send a normal response format back to Solr with the proper error code instead of raw send error and closing the connection. Just that will help those security tests a lot. 

On Wed, Nov 6, 2019 at 9:19 AM Mark Miller <[hidden email]> wrote:
You can still make progress this way. For example, all those like auth type tests, you can fix them mostly by not closing connections in solrdispatchfilter and waiting for collection creates, and maybe a security file in zk watcher thing. 

You can still make progress this way, you just prob won’t ever get to the end of the road per say. 

On Mon, Nov 4, 2019 at 6:48 AM Erick Erickson <[hidden email]> wrote:
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
--
--
--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
Really though, if you want to fix tests, start fixing the performance bottlenecks.

Like being able to say solr.Class or just class in configs. That costs you your life, especially if you use more than one core. That speed trap alone hides enough bugs and grubs to make Pumbaa happy.

On Wed, Nov 6, 2019 at 9:31 AM Mark Miller <[hidden email]> wrote:
I mean honestly, if you just ignore me and fix the the smaller list of critical things that affect like everything - that will make the system at least look like it’s almost good. 

On Wed, Nov 6, 2019 at 9:22 AM Mark Miller <[hidden email]> wrote:
Like one of the biggest things those auth guys do is just raw response send error. Override our response where it does close shield and make it so those sendErrors actually just send a normal response format back to Solr with the proper error code instead of raw send error and closing the connection. Just that will help those security tests a lot. 

On Wed, Nov 6, 2019 at 9:19 AM Mark Miller <[hidden email]> wrote:
You can still make progress this way. For example, all those like auth type tests, you can fix them mostly by not closing connections in solrdispatchfilter and waiting for collection creates, and maybe a security file in zk watcher thing. 

You can still make progress this way, you just prob won’t ever get to the end of the road per say. 

On Mon, Nov 4, 2019 at 6:48 AM Erick Erickson <[hidden email]> wrote:
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
--
--
--


--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
Just like go look. Like your dog is limping and you check it out. You will find fleas everywhere first, and then ticks, and then maybe some infection. And it won't be hard at all. Just you caring for your dog and checking him out cause he's limping. You will end up eating days at the vet, huge bills, but it's no mind game to get there.

On Wed, Nov 6, 2019 at 9:42 AM Mark Miller <[hidden email]> wrote:
Really though, if you want to fix tests, start fixing the performance bottlenecks.

Like being able to say solr.Class or just class in configs. That costs you your life, especially if you use more than one core. That speed trap alone hides enough bugs and grubs to make Pumbaa happy.

On Wed, Nov 6, 2019 at 9:31 AM Mark Miller <[hidden email]> wrote:
I mean honestly, if you just ignore me and fix the the smaller list of critical things that affect like everything - that will make the system at least look like it’s almost good. 

On Wed, Nov 6, 2019 at 9:22 AM Mark Miller <[hidden email]> wrote:
Like one of the biggest things those auth guys do is just raw response send error. Override our response where it does close shield and make it so those sendErrors actually just send a normal response format back to Solr with the proper error code instead of raw send error and closing the connection. Just that will help those security tests a lot. 

On Wed, Nov 6, 2019 at 9:19 AM Mark Miller <[hidden email]> wrote:
You can still make progress this way. For example, all those like auth type tests, you can fix them mostly by not closing connections in solrdispatchfilter and waiting for collection creates, and maybe a security file in zk watcher thing. 

You can still make progress this way, you just prob won’t ever get to the end of the road per say. 

On Mon, Nov 4, 2019 at 6:48 AM Erick Erickson <[hidden email]> wrote:
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
--
--
--


--


--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
Another easy powerful one. 

When you wait for state, you have to consider all the states you can actually see.

So like, if you are waiting for replicas and the shard is not even there yet, like you have to check getSlices==null, then keep waiting. Youcan't miss little things when waiting for the state you want to see. Go fix that and again, return will be huge for the effort.

On Wed, Nov 6, 2019 at 10:34 AM Mark Miller <[hidden email]> wrote:
Just like go look. Like your dog is limping and you check it out. You will find fleas everywhere first, and then ticks, and then maybe some infection. And it won't be hard at all. Just you caring for your dog and checking him out cause he's limping. You will end up eating days at the vet, huge bills, but it's no mind game to get there.

On Wed, Nov 6, 2019 at 9:42 AM Mark Miller <[hidden email]> wrote:
Really though, if you want to fix tests, start fixing the performance bottlenecks.

Like being able to say solr.Class or just class in configs. That costs you your life, especially if you use more than one core. That speed trap alone hides enough bugs and grubs to make Pumbaa happy.

On Wed, Nov 6, 2019 at 9:31 AM Mark Miller <[hidden email]> wrote:
I mean honestly, if you just ignore me and fix the the smaller list of critical things that affect like everything - that will make the system at least look like it’s almost good. 

On Wed, Nov 6, 2019 at 9:22 AM Mark Miller <[hidden email]> wrote:
Like one of the biggest things those auth guys do is just raw response send error. Override our response where it does close shield and make it so those sendErrors actually just send a normal response format back to Solr with the proper error code instead of raw send error and closing the connection. Just that will help those security tests a lot. 

On Wed, Nov 6, 2019 at 9:19 AM Mark Miller <[hidden email]> wrote:
You can still make progress this way. For example, all those like auth type tests, you can fix them mostly by not closing connections in solrdispatchfilter and waiting for collection creates, and maybe a security file in zk watcher thing. 

You can still make progress this way, you just prob won’t ever get to the end of the road per say. 

On Mon, Nov 4, 2019 at 6:48 AM Erick Erickson <[hidden email]> wrote:
With all the discussion about major how broken Solr is, I’m not sure what the value of these is at this point, but they’re quick to do so what the heck. I probably won’t be changing annotations until that discussion reaches a conclusion. although I’ll continue to send these out.

********Failures in Hoss' reports for the last 4 rollups.

There were 139 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected
These tests were NOT BadApple'd or AwaitsFix'd


Failures in the last 4 reports..
   Report   Pct     runs    fails           test
     0123   3.5      912     48      BasicAuthIntegrationTest.testBasicAuth
     0123   1.2      934     10      DimensionalRoutedAliasUpdateProcessorTest.testCatTime
     0123   0.6      934     12      DimensionalRoutedAliasUpdateProcessorTest.testTimeCat
     0123   3.4      939     71      LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
     0123   0.6      906      6      MathExpressionTest.testGammaDistribution
     0123   7.1       86      5      ShardSplitTest.testSplitWithChaosMonkey
     0123   6.1      922     48      TestCloudSearcherWarming.testRepFactor1LeaderStartup
     0123   6.1      936     75      TestModelManagerPersistence.testFilePersistence
     0123   5.2      936     74      TestModelManagerPersistence.testWrapperModelPersistence
     0123   2.8      903     20      TestSkipOverseerOperations.testSkipLeaderOperations
************ Will BadApple all tests above this line except ones listed at the top**************



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
--
--
--


--


--


--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
This BadApple stuff would have more value after more valuable work though.

I can't stress it enough - you have to make this fast to fix it.

I'll give you some more items to consider:

* Our xml parsing is deathly slow and blocking. All blocking stuff when cores start is death to multicore. You can use a non blocking, modern fast parser to parse our docs and config.
* You can also find various statics that are expensive to init and block - moving some of those to init right away can help multicore alot as well. Getting multicore more than deathly slow is a big big help to find stuff.
* Making the enscryption key stuff is slow and blocks - don't make it for every tests and every core when its not needed.
* The metrics stuff is sllooow startup and shutdown. Do that stuff in parallel.
* SolrCoreState has issues where it doesn't always clean up - I think can hurt reload the most.
* reload has lots of holes especially on failure cases. I don't know - make more tests.
Coreaware stuff and listens can be multi threaded - all that being single threaded is no good - like modern hardware man.
Most of the stuff people get wrong can be pulled in easy to use útil classes
We need to allow jetty time to stop for good startup and shutdown - you have to fix other stuff first - things like the overseer make shutdown a nightmare in tests.
* With the current Overseer it's best to reorg tests to try and shut it down last. I know this sucks, fix that too.
* One small help, the syste doesn’t properly wait like it tries on shutdown for overseer to run its queue.
* A lot of close and shutdown is slower and wrong order of stuff and gnarly.
* We need to have a cluster shutdown if it will ever actually be clean - how about writing to a control znode to trigger it?
* How about creating our znodes for a cluster up front in like an install process? Right now there are many races around this. Often the config you specify in tests (or more than often?) is not the one you think.
* We throw a lot of already close exceptions and stuff where we should not - this is to get around our broken shutdown - they are bad, so fix shutdown, remove them - they should only usually exist where something is trying to start a resource, not use it.
* There also concurrency issues in SolrCores. Plus I'd speed a lot of that locking up. There are searcher leaks in SolrCore as well.

hmmm... lots more, but even that is a nice dent. Mostly make things fast, the tests will start to whisper the secrets.
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
Lets see as long as it's time of mind...

SolrDispatchFilter should wait for a core if its loading.
proxy remote request can be crazy - maybe less crazy if you fix other things, but see my starburst branch (thats missing so much good stuff :(*) for a better impl that uses http2.
Get all our solrcloud tests off that solrtest4j or whatever its called base class. It wasn't designed for that, it causes all sorts of little issues.
The way we track objects in maps in solrresourceloader - there is a nasty bug where we use a wrong collection field name - but also that concurrency is slow.
ZkSolrResourceLoader SHOULD NOT fall back to SolrResourceLoader - a lot of this type of crap also hides bugs.

On Wed, Nov 6, 2019 at 11:48 AM Mark Miller <[hidden email]> wrote:
This BadApple stuff would have more value after more valuable work though.

I can't stress it enough - you have to make this fast to fix it.

I'll give you some more items to consider:

* Our xml parsing is deathly slow and blocking. All blocking stuff when cores start is death to multicore. You can use a non blocking, modern fast parser to parse our docs and config.
* You can also find various statics that are expensive to init and block - moving some of those to init right away can help multicore alot as well. Getting multicore more than deathly slow is a big big help to find stuff.
* Making the enscryption key stuff is slow and blocks - don't make it for every tests and every core when its not needed.
* The metrics stuff is sllooow startup and shutdown. Do that stuff in parallel.
* SolrCoreState has issues where it doesn't always clean up - I think can hurt reload the most.
* reload has lots of holes especially on failure cases. I don't know - make more tests.
Coreaware stuff and listens can be multi threaded - all that being single threaded is no good - like modern hardware man.
Most of the stuff people get wrong can be pulled in easy to use útil classes
We need to allow jetty time to stop for good startup and shutdown - you have to fix other stuff first - things like the overseer make shutdown a nightmare in tests.
* With the current Overseer it's best to reorg tests to try and shut it down last. I know this sucks, fix that too.
* One small help, the syste doesn’t properly wait like it tries on shutdown for overseer to run its queue.
* A lot of close and shutdown is slower and wrong order of stuff and gnarly.
* We need to have a cluster shutdown if it will ever actually be clean - how about writing to a control znode to trigger it?
* How about creating our znodes for a cluster up front in like an install process? Right now there are many races around this. Often the config you specify in tests (or more than often?) is not the one you think.
* We throw a lot of already close exceptions and stuff where we should not - this is to get around our broken shutdown - they are bad, so fix shutdown, remove them - they should only usually exist where something is trying to start a resource, not use it.
* There also concurrency issues in SolrCores. Plus I'd speed a lot of that locking up. There are searcher leaks in SolrCore as well.

hmmm... lots more, but even that is a nice dent. Mostly make things fast, the tests will start to whisper the secrets.


--
Reply | Threaded
Open this post in threaded view
|

Re: BadApple

Mark Miller-3
Some of these are hard to trigger without doing a lot of other things. Like you have to make the overseer much faster. Much as I dislike that thing, you can make much much much faster as it is and that will help. Many many bugs hide because we crawl. 

On Wed, Nov 6, 2019 at 12:19 PM Mark Miller <[hidden email]> wrote:
Lets see as long as it's time of mind...

SolrDispatchFilter should wait for a core if its loading.
proxy remote request can be crazy - maybe less crazy if you fix other things, but see my starburst branch (thats missing so much good stuff :(*) for a better impl that uses http2.
Get all our solrcloud tests off that solrtest4j or whatever its called base class. It wasn't designed for that, it causes all sorts of little issues.
The way we track objects in maps in solrresourceloader - there is a nasty bug where we use a wrong collection field name - but also that concurrency is slow.
ZkSolrResourceLoader SHOULD NOT fall back to SolrResourceLoader - a lot of this type of crap also hides bugs.

On Wed, Nov 6, 2019 at 11:48 AM Mark Miller <[hidden email]> wrote:
This BadApple stuff would have more value after more valuable work though.

I can't stress it enough - you have to make this fast to fix it.

I'll give you some more items to consider:

* Our xml parsing is deathly slow and blocking. All blocking stuff when cores start is death to multicore. You can use a non blocking, modern fast parser to parse our docs and config.
* You can also find various statics that are expensive to init and block - moving some of those to init right away can help multicore alot as well. Getting multicore more than deathly slow is a big big help to find stuff.
* Making the enscryption key stuff is slow and blocks - don't make it for every tests and every core when its not needed.
* The metrics stuff is sllooow startup and shutdown. Do that stuff in parallel.
* SolrCoreState has issues where it doesn't always clean up - I think can hurt reload the most.
* reload has lots of holes especially on failure cases. I don't know - make more tests.
Coreaware stuff and listens can be multi threaded - all that being single threaded is no good - like modern hardware man.
Most of the stuff people get wrong can be pulled in easy to use útil classes
We need to allow jetty time to stop for good startup and shutdown - you have to fix other stuff first - things like the overseer make shutdown a nightmare in tests.
* With the current Overseer it's best to reorg tests to try and shut it down last. I know this sucks, fix that too.
* One small help, the syste doesn’t properly wait like it tries on shutdown for overseer to run its queue.
* A lot of close and shutdown is slower and wrong order of stuff and gnarly.
* We need to have a cluster shutdown if it will ever actually be clean - how about writing to a control znode to trigger it?
* How about creating our znodes for a cluster up front in like an install process? Right now there are many races around this. Often the config you specify in tests (or more than often?) is not the one you think.
* We throw a lot of already close exceptions and stuff where we should not - this is to get around our broken shutdown - they are bad, so fix shutdown, remove them - they should only usually exist where something is trying to start a resource, not use it.
* There also concurrency issues in SolrCores. Plus I'd speed a lot of that locking up. There are searcher leaks in SolrCore as well.

hmmm... lots more, but even that is a nice dent. Mostly make things fast, the tests will start to whisper the secrets.


--
--