Intermittent NullPointerException in ZkController on Solr 8.x

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Intermittent NullPointerException in ZkController on Solr 8.x

Colvin Cowie
Hello,

I've only seen this very occasionally on our automated tests so I've never really dug into it.
Running on Solr 8.3.1

2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=XXXXXXXXX:9983/xxxx_cluster
2020-05-19 03:44:40.238 INFO  (main) [   ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider
2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
2020-05-19 03:44:56.614 ERROR (main) [   ] o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property and the logs
2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore null:java.lang.NullPointerException
at org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
at org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
at org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
at org.apache.solr.cloud.ZkController.init(ZkController.java:917)
at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473)
at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)
at xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)

I couldn't find any bug reports in JIRA for the NPE.

Here's the full log


byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", cloudManager.getTimeSource().getEpochTimeNs()));
so I don't know whether it's the cloudManager or the time source that's null.
That bit of the ZkController was added by https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if
zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);

We have never (knowingly) configured autoscaling and we don't use it, but I see the autoscaling files are present in ZK. Is the autoscaling.json etc created by default when it is absent in ZooKeeper?


The interesting bit of the log above, aside from the NPE, is this:
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
Which suggests to me that there's either a race condition or the problem is caused by some zookeeper outage during startup. Since there's a 16 second gap between those messages.

It's possible that the problem is in some way caused by our own code in xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so should be autoclosed), but that all happens before createCoreContainer is called.

Any thoughts?

Thanks,
Colvin
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent NullPointerException in ZkController on Solr 8.x

Ishan Chattopadhyaya
Hi Colvin,
Off topic: but did you consider contributing your robust SDF changes upstream? It you need it, others might as well. And if everyone is using it, the tracking down of bugs in supported code becomes easier as well. 🙂
Regards,
Ishan

On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <[hidden email]> wrote:
Hello,

I've only seen this very occasionally on our automated tests so I've never really dug into it.
Running on Solr 8.3.1

2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=XXXXXXXXX:9983/xxxx_cluster
2020-05-19 03:44:40.238 INFO  (main) [   ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider
2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
2020-05-19 03:44:56.614 ERROR (main) [   ] o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property and the logs
2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore null:java.lang.NullPointerException
at org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
at org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
at org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
at org.apache.solr.cloud.ZkController.init(ZkController.java:917)
at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473)
at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)
at xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)

I couldn't find any bug reports in JIRA for the NPE.

Here's the full log


byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", cloudManager.getTimeSource().getEpochTimeNs()));
so I don't know whether it's the cloudManager or the time source that's null.
That bit of the ZkController was added by https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if
zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);

We have never (knowingly) configured autoscaling and we don't use it, but I see the autoscaling files are present in ZK. Is the autoscaling.json etc created by default when it is absent in ZooKeeper?


The interesting bit of the log above, aside from the NPE, is this:
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
Which suggests to me that there's either a race condition or the problem is caused by some zookeeper outage during startup. Since there's a 16 second gap between those messages.

It's possible that the problem is in some way caused by our own code in xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so should be autoclosed), but that all happens before createCoreContainer is called.

Any thoughts?

Thanks,
Colvin
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent NullPointerException in ZkController on Solr 8.x

Colvin Cowie
Hi Ishan, thanks for prompting me to look at it again.
The robust SDF was something I added to our product before https://issues.apache.org/jira/browse/SOLR-5129 was implemented, which should achieve the same effect (unless my understanding of the intention behind it is wrong).
However when I tested it previously it didn't work... I can see why it doesn't work though, so I will create an issue for that and can put a patch in.


On Wed, 20 May 2020 at 16:18, Ishan Chattopadhyaya <[hidden email]> wrote:
Hi Colvin,
Off topic: but did you consider contributing your robust SDF changes upstream? It you need it, others might as well. And if everyone is using it, the tracking down of bugs in supported code becomes easier as well. 🙂
Regards,
Ishan

On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <[hidden email]> wrote:
Hello,

I've only seen this very occasionally on our automated tests so I've never really dug into it.
Running on Solr 8.3.1

2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=XXXXXXXXX:9983/xxxx_cluster
2020-05-19 03:44:40.238 INFO  (main) [   ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider
2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
2020-05-19 03:44:56.614 ERROR (main) [   ] o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property and the logs
2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore null:java.lang.NullPointerException
at org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
at org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
at org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
at org.apache.solr.cloud.ZkController.init(ZkController.java:917)
at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473)
at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)
at xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)

I couldn't find any bug reports in JIRA for the NPE.

Here's the full log


byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", cloudManager.getTimeSource().getEpochTimeNs()));
so I don't know whether it's the cloudManager or the time source that's null.
That bit of the ZkController was added by https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if
zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);

We have never (knowingly) configured autoscaling and we don't use it, but I see the autoscaling files are present in ZK. Is the autoscaling.json etc created by default when it is absent in ZooKeeper?


The interesting bit of the log above, aside from the NPE, is this:
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
Which suggests to me that there's either a race condition or the problem is caused by some zookeeper outage during startup. Since there's a 16 second gap between those messages.

It's possible that the problem is in some way caused by our own code in xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so should be autoclosed), but that all happens before createCoreContainer is called.

Any thoughts?

Thanks,
Colvin
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent NullPointerException in ZkController on Solr 8.x

Colvin Cowie
(Patch for that issue available on https://issues.apache.org/jira/browse/SOLR-14503)

On Wed, 20 May 2020 at 17:06, Colvin Cowie <[hidden email]> wrote:
Hi Ishan, thanks for prompting me to look at it again.
The robust SDF was something I added to our product before https://issues.apache.org/jira/browse/SOLR-5129 was implemented, which should achieve the same effect (unless my understanding of the intention behind it is wrong).
However when I tested it previously it didn't work... I can see why it doesn't work though, so I will create an issue for that and can put a patch in.


On Wed, 20 May 2020 at 16:18, Ishan Chattopadhyaya <[hidden email]> wrote:
Hi Colvin,
Off topic: but did you consider contributing your robust SDF changes upstream? It you need it, others might as well. And if everyone is using it, the tracking down of bugs in supported code becomes easier as well. 🙂
Regards,
Ishan

On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <[hidden email]> wrote:
Hello,

I've only seen this very occasionally on our automated tests so I've never really dug into it.
Running on Solr 8.3.1

2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=XXXXXXXXX:9983/xxxx_cluster
2020-05-19 03:44:40.238 INFO  (main) [   ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider
2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
2020-05-19 03:44:56.614 ERROR (main) [   ] o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property and the logs
2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore null:java.lang.NullPointerException
at org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
at org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
at org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
at org.apache.solr.cloud.ZkController.init(ZkController.java:917)
at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473)
at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)
at xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)

I couldn't find any bug reports in JIRA for the NPE.

Here's the full log


byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", cloudManager.getTimeSource().getEpochTimeNs()));
so I don't know whether it's the cloudManager or the time source that's null.
That bit of the ZkController was added by https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if
zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);

We have never (knowingly) configured autoscaling and we don't use it, but I see the autoscaling files are present in ZK. Is the autoscaling.json etc created by default when it is absent in ZooKeeper?


The interesting bit of the log above, aside from the NPE, is this:
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
Which suggests to me that there's either a race condition or the problem is caused by some zookeeper outage during startup. Since there's a 16 second gap between those messages.

It's possible that the problem is in some way caused by our own code in xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so should be autoclosed), but that all happens before createCoreContainer is called.

Any thoughts?

Thanks,
Colvin
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent NullPointerException in ZkController on Solr 8.x

Colvin Cowie
Found the problem in ZkController, I've opened https://issues.apache.org/jira/browse/SOLR-14504 and will add a patch

On Wed, 20 May 2020 at 18:58, Colvin Cowie <[hidden email]> wrote:
(Patch for that issue available on https://issues.apache.org/jira/browse/SOLR-14503)

On Wed, 20 May 2020 at 17:06, Colvin Cowie <[hidden email]> wrote:
Hi Ishan, thanks for prompting me to look at it again.
The robust SDF was something I added to our product before https://issues.apache.org/jira/browse/SOLR-5129 was implemented, which should achieve the same effect (unless my understanding of the intention behind it is wrong).
However when I tested it previously it didn't work... I can see why it doesn't work though, so I will create an issue for that and can put a patch in.


On Wed, 20 May 2020 at 16:18, Ishan Chattopadhyaya <[hidden email]> wrote:
Hi Colvin,
Off topic: but did you consider contributing your robust SDF changes upstream? It you need it, others might as well. And if everyone is using it, the tracking down of bugs in supported code becomes easier as well. 🙂
Regards,
Ishan

On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <[hidden email]> wrote:
Hello,

I've only seen this very occasionally on our automated tests so I've never really dug into it.
Running on Solr 8.3.1

2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=XXXXXXXXX:9983/xxxx_cluster
2020-05-19 03:44:40.238 INFO  (main) [   ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider
2020-05-19 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.245 INFO  (zkConnectionManagerCallback-11-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.245 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.359 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-05-19 03:44:40.361 INFO  (zkConnectionManagerCallback-13-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-05-19 03:44:40.361 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
2020-05-19 03:44:56.614 ERROR (main) [   ] o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property and the logs
2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore null:java.lang.NullPointerException
at org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
at org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
at org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
at org.apache.solr.cloud.ZkController.init(ZkController.java:917)
at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473)
at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)
at xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)

I couldn't find any bug reports in JIRA for the NPE.

Here's the full log


byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", cloudManager.getTimeSource().getEpochTimeNs()));
so I don't know whether it's the cloudManager or the time source that's null.
That bit of the ZkController was added by https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if
zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);

We have never (knowingly) configured autoscaling and we don't use it, but I see the autoscaling files are present in ZK. Is the autoscaling.json etc created by default when it is absent in ZooKeeper?


The interesting bit of the log above, aside from the NPE, is this:
2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
2020-05-19 03:44:56.606 INFO  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)
Which suggests to me that there's either a race condition or the problem is caused by some zookeeper outage during startup. Since there's a 16 second gap between those messages.

It's possible that the problem is in some way caused by our own code in xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so should be autoclosed), but that all happens before createCoreContainer is called.

Any thoughts?

Thanks,
Colvin