How to prevent solr from deleting cores when getting an empty config from zookeeper

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How to prevent solr from deleting cores when getting an empty config from zookeeper

Koen De Groote
Hello,

I recently ran in to the following scenario:

Solr, version 7.5, in a docker container, running as cloud, with an
external zookeeper ensemble of 3 zookeepers. Instructions were followed to
make a root first, this was set correctly, as could be seen by the solr
logs outputting the connect info.

root command is: "bin/solr zk mkroot /solr -z <addrress>"

For a yet undetermined reason, the zookeeper ensemble had some kind of
split-brain occur. At a later point, Solr was restarted and then suddenly
all its directories were gone.

By which I mean: the directories containing the configuration and the data.
The stopwords, the schema, the solr config, the "shard1_replica_n2"
directories, those directories.

Those were gone without a trace.

As far as I can tell, solr started, asked zookeeper for its config,
zookeeper returned an empty config and consequently "made it so".

I am by no means very knowledgeable about solr internals. Can anyone chime
in as to what happened here and how to prevent it? Is more info needed?

Ideally, if something like this were to happen, I'd like for either solr to
not delete folders or if that's not possible, add some kind of pre-startup
check that stops solr from going any further if things go wrong.

Regards,
Koen
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Gus Heck
Deleting data on a zookeeper hiccup does sound bad if it's really solr's
fault. Can you work up a set of steps to reproduce? Something like install
solr, index tech products example, shut down solr, perform some editing to
zk, start solr, observe data gone (but with lots of details about exact
configurations/commands/edits etc)?

"some sort of split brain" is nebulous and nobody will know if they've
solved your problem unless that can be quantified and the problem
replicated.

-Gus

On Tue, Apr 9, 2019 at 1:37 PM Koen De Groote <[hidden email]>
wrote:

> Hello,
>
> I recently ran in to the following scenario:
>
> Solr, version 7.5, in a docker container, running as cloud, with an
> external zookeeper ensemble of 3 zookeepers. Instructions were followed to
> make a root first, this was set correctly, as could be seen by the solr
> logs outputting the connect info.
>
> root command is: "bin/solr zk mkroot /solr -z <addrress>"
>
> For a yet undetermined reason, the zookeeper ensemble had some kind of
> split-brain occur. At a later point, Solr was restarted and then suddenly
> all its directories were gone.
>
> By which I mean: the directories containing the configuration and the data.
> The stopwords, the schema, the solr config, the "shard1_replica_n2"
> directories, those directories.
>
> Those were gone without a trace.
>
> As far as I can tell, solr started, asked zookeeper for its config,
> zookeeper returned an empty config and consequently "made it so".
>
> I am by no means very knowledgeable about solr internals. Can anyone chime
> in as to what happened here and how to prevent it? Is more info needed?
>
> Ideally, if something like this were to happen, I'd like for either solr to
> not delete folders or if that's not possible, add some kind of pre-startup
> check that stops solr from going any further if things go wrong.
>
> Regards,
> Koen
>


--
http://www.the111shift.com
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Koen De Groote
Attached to this mail is a tar.gz with instructions to reproduce.

It contains 3 text files with commands and comments. Be sure to check the actual commands before executing.
This was tested on a Ubuntu 18.04 VM, with docker installed on it.

The order of execution is:

- zookeeper instructions.txt
- solr instructions.txt
- after setup.txt

The behaviour was reproduced several times and it was found that it was the solr/java process calling for the deletion of the files and directories.

The basic steps are: set up zookeeper, set up solr root, set up solr. Create dummy collection with example data. Stop the containers. Delete the zookeeper 'version-2' folder. Recreate zookeeper container. Redo the mkroot, recreate solr container. At this point, solr will start complaining about the cores after a bit and then the data folders will be deleted.

Hope this is clear and complete.

Regard,
Koen

On Wed, Apr 10, 2019 at 3:20 PM Gus Heck <[hidden email]> wrote:
Deleting data on a zookeeper hiccup does sound bad if it's really solr's
fault. Can you work up a set of steps to reproduce? Something like install
solr, index tech products example, shut down solr, perform some editing to
zk, start solr, observe data gone (but with lots of details about exact
configurations/commands/edits etc)?

"some sort of split brain" is nebulous and nobody will know if they've
solved your problem unless that can be quantified and the problem
replicated.

-Gus

On Tue, Apr 9, 2019 at 1:37 PM Koen De Groote <[hidden email]>
wrote:

> Hello,
>
> I recently ran in to the following scenario:
>
> Solr, version 7.5, in a docker container, running as cloud, with an
> external zookeeper ensemble of 3 zookeepers. Instructions were followed to
> make a root first, this was set correctly, as could be seen by the solr
> logs outputting the connect info.
>
> root command is: "bin/solr zk mkroot /solr -z <addrress>"
>
> For a yet undetermined reason, the zookeeper ensemble had some kind of
> split-brain occur. At a later point, Solr was restarted and then suddenly
> all its directories were gone.
>
> By which I mean: the directories containing the configuration and the data.
> The stopwords, the schema, the solr config, the "shard1_replica_n2"
> directories, those directories.
>
> Those were gone without a trace.
>
> As far as I can tell, solr started, asked zookeeper for its config,
> zookeeper returned an empty config and consequently "made it so".
>
> I am by no means very knowledgeable about solr internals. Can anyone chime
> in as to what happened here and how to prevent it? Is more info needed?
>
> Ideally, if something like this were to happen, I'd like for either solr to
> not delete folders or if that's not possible, add some kind of pre-startup
> check that stops solr from going any further if things go wrong.
>
> Regards,
> Koen
>


--
http://www.the111shift.com

instructions.tar.gz (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Shawn Heisey-2
On 4/11/2019 3:17 AM, Koen De Groote wrote:
> The basic steps are: set up zookeeper, set up solr root, set up solr.
> Create dummy collection with example data. Stop the containers. Delete
> the zookeeper 'version-2' folder. Recreate zookeeper container. Redo the
> mkroot, recreate solr container. At this point, solr will start
> complaining about the cores after a bit and then the data folders will
> be deleted.

By deleting the "version-2" folder, you deleted the entire ZooKeeper
database.  All of the information that makes up your entire SolrCloud
cluster is *gone*.

We are trying as hard as we can to move to a "ZooKeeper As Truth" model.
  Right now, the truth of a SolrCloud cluster is a combination of what's
in ZooKeeper and what actually exists on disk, rather than just what's
in ZK.

It surprises me greatly that Solr is deleting data.  I would expect it
to simply ignore cores during startup if there is no corresponding data
in ZooKeeper.  In the past, I have seen evidence of it doing exactly that.

So although it sounds like we do have a bug that needs fixing (SolrCloud
should never delete data unless it has been explicitly asked to do so),
you created this problem yourself by deleting all of your ZooKeeper data
-- in essence, deleting the entire cluster.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Koen De Groote
Shawn,

Apologies, I should have explained more properly.

To clarify: manually deleting the 'version-2' directory is never something
that happened when I first observed this behavior.
The reason I did it in this example is that it's the fastest and simplest
way to demonstrate the behavior.

What I have experienced is attempting to add a zookeeper to the ensemble,
and despite the container starting, not being added to the ensemble.
Probably iptables/firewall causing that. So this zookeeper is "empty"
because it could not sync with the others.
However, the deploy doesn't stop because it sees the container running so
it thinks everything is OK. Then the deploy continues and restarts the solr
container(because of the new ZK_HOST configuration having to be provided.
Far as I know this is not possible dynamically).
At this time, for some reason, solr connects to that specific zookeeper
first, gets an empty configuration and deletes the folders. This I have
seen happen.

This is the "split brain thing" I referred to in my first email.

I have also seen a move from native to container-based zookeepers where the
zookeeper 'version-2' data folder was not properly mounted. The native
zookeepers had a different location than what was provided.
The automated deploy process checks that the folder exists, creates an
empty one if it doesn't and just continues to the next step.
This also resulted in solr connecting to an "empty" zookeeper and deleting
all the folders.


So what I needed to simulate was a Solr that had cores, connecting to an
empty zookeeper, and then losing those cores.
Manually doing the zookeeper delete is a far simpler and a more easily
reproducible way of demonstrating this.
I could have started a second zookeeper with a different data mount, then
updated the ZK_HOST to point only to that one, I suppose.
It's just an example of how to arrive at these events.

That being explained, am I right in understanding that currently there is
no way of configuring Solr so that it won't delete the folders, in this
event?

I'm in the process of writing a script that basically does "docker exec
<zookeeper container> bin/zkCli.sh ls <path>" to every single known
zookeeper container and if they don't all return what I expect, the deploy
stops right before starting the solr container. That should be a safeguard,
for now, I suppose?

Side note in general for anyone reading this later in the archives: the
instructions tar.gz in my previous message contains the output of an audit
rule that was put on the data folder.
This output shows the process that is performing the delete is, in fact,
the solr(java) process, performing syscall "unlink" and "rmdir" on the
specific files and directories.

Regards,
Koen






On Thu, Apr 11, 2019 at 7:00 PM Shawn Heisey <[hidden email]> wrote:

> On 4/11/2019 3:17 AM, Koen De Groote wrote:
> > The basic steps are: set up zookeeper, set up solr root, set up solr.
> > Create dummy collection with example data. Stop the containers. Delete
> > the zookeeper 'version-2' folder. Recreate zookeeper container. Redo the
> > mkroot, recreate solr container. At this point, solr will start
> > complaining about the cores after a bit and then the data folders will
> > be deleted.
>
> By deleting the "version-2" folder, you deleted the entire ZooKeeper
> database.  All of the information that makes up your entire SolrCloud
> cluster is *gone*.
>
> We are trying as hard as we can to move to a "ZooKeeper As Truth" model.
>   Right now, the truth of a SolrCloud cluster is a combination of what's
> in ZooKeeper and what actually exists on disk, rather than just what's
> in ZK.
>
> It surprises me greatly that Solr is deleting data.  I would expect it
> to simply ignore cores during startup if there is no corresponding data
> in ZooKeeper.  In the past, I have seen evidence of it doing exactly that.
>
> So although it sounds like we do have a bug that needs fixing (SolrCloud
> should never delete data unless it has been explicitly asked to do so),
> you created this problem yourself by deleting all of your ZooKeeper data
> -- in essence, deleting the entire cluster.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Shawn Heisey-2
On 4/11/2019 2:40 PM, Koen De Groote wrote:
> That being explained, am I right in understanding that currently there is
> no way of configuring Solr so that it won't delete the folders, in this
> event?

In my opinion, Solr should never delete cores unless it has been
explicitly *ASKED* to do so with some kind of delete request.

I found this issue, which might explain it:

https://issues.apache.org/jira/browse/SOLR-12066

I think I understand what the goal was with that issue, but if I
understand it correctly, it basically will cause SolrCloud to delete any
core from disk on startup which is not referenced in ZooKeeper.

Can you share a solr.log file from a Solr instance that is deleting
data?  You'll need to use a file sharing site -- attachments are almost
always stripped by the mailing list software.  It will be helpful to
know what directories and core names were deleted, so that info can be
checked in the log.

I do not know whether Solr logs anything when it deletes a core.  If
not, it should.

Here what I think should happen instead:  If a core that SolrCloud finds
during startup is not in the ZooKeeper database, it should simply not
start, with a helpful message at WARN in the log.  The data should never
be deleted automatically on startup.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Koen De Groote
I gathered a solr log from 7.6.0 at TRACE level.

Then I replicated the experiment with 6.6.5 and with that version, the
directories were not deleted. Log also included.

The audit log is from solr7. The deletes start at 01:51:48, which
translates to 23:51:48 UTC, which you'll be able to find in the solr7 log.
The directories were deleted, you can see the calls in the audit logs, but
I can't identify in the solr7 log if a delete is being called somewhere.
Could be that it's not logged at all.

I zipped it all and put it on dropbox:
https://www.dropbox.com/s/fei2td3zdh92i67/research.zip?dl=0

The order of the setup is:

- zookeeper
- solr
- after setup(contains audit log output of a previous attempt)

Anyone trying to replicate, beware to not blindly copy the commands.
This was done on a fresh Ubuntu 18.04 VM, which I suggest for anyone
wanting to test this.

Regards,
Koen



On Thu, Apr 11, 2019 at 11:21 PM Shawn Heisey <[hidden email]> wrote:

> On 4/11/2019 2:40 PM, Koen De Groote wrote:
> > That being explained, am I right in understanding that currently there is
> > no way of configuring Solr so that it won't delete the folders, in this
> > event?
>
> In my opinion, Solr should never delete cores unless it has been
> explicitly *ASKED* to do so with some kind of delete request.
>
> I found this issue, which might explain it:
>
> https://issues.apache.org/jira/browse/SOLR-12066
>
> I think I understand what the goal was with that issue, but if I
> understand it correctly, it basically will cause SolrCloud to delete any
> core from disk on startup which is not referenced in ZooKeeper.
>
> Can you share a solr.log file from a Solr instance that is deleting
> data?  You'll need to use a file sharing site -- attachments are almost
> always stripped by the mailing list software.  It will be helpful to
> know what directories and core names were deleted, so that info can be
> checked in the log.
>
> I do not know whether Solr logs anything when it deletes a core.  If
> not, it should.
>
> Here what I think should happen instead:  If a core that SolrCloud finds
> during startup is not in the ZooKeeper database, it should simply not
> start, with a helpful message at WARN in the log.  The data should never
> be deleted automatically on startup.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Shawn Heisey-2
On 4/11/2019 6:44 PM, Koen De Groote wrote:

> I gathered a solr log from 7.6.0 at TRACE level.
>
> Then I replicated the experiment with 6.6.5 and with that version, the
> directories were not deleted. Log also included.
>
> The audit log is from solr7. The deletes start at 01:51:48, which
> translates to 23:51:48 UTC, which you'll be able to find in the solr7 log.
> The directories were deleted, you can see the calls in the audit logs, but
> I can't identify in the solr7 log if a delete is being called somewhere.
> Could be that it's not logged at all.

I think that SOLR-12066 is indeed the cause of the problem.  The intent
with that issue was to eliminate cores that had been deleted while the
node was down ... but in practice, it serves to delete any core data
that isn't in the clusterstate.

It's certainly true that a well-designed ZooKeeper ensemble with a
minimum of three nodes is extremely unlikely to lose its database, but
somebody might use the wrong ZKHOST setting and accidentally point their
SolrCloud install at an ensemble that exists but has no data.  Problems
with a chroot are most likely, I think -- forgetting to add the chroot
seems like a good way to cause this issue.

I have opened an issue for the problem.

https://issues.apache.org/jira/browse/SOLR-13396

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Yogendra Kumar Soni
I faced similar situation with solr cloud.
http://lucene.472066.n3.nabble.com/Solr-Cloud-wiping-all-cores-when-restart-without-proper-zookeeper-directories-td4420598.html

Solr is deleting folders containing solr cores only any other folder is
intact.

On Fri, Apr 12, 2019 at 7:03 AM Shawn Heisey <[hidden email]> wrote:

> On 4/11/2019 6:44 PM, Koen De Groote wrote:
> > I gathered a solr log from 7.6.0 at TRACE level.
> >
> > Then I replicated the experiment with 6.6.5 and with that version, the
> > directories were not deleted. Log also included.
> >
> > The audit log is from solr7. The deletes start at 01:51:48, which
> > translates to 23:51:48 UTC, which you'll be able to find in the solr7
> log.
> > The directories were deleted, you can see the calls in the audit logs,
> but
> > I can't identify in the solr7 log if a delete is being called somewhere.
> > Could be that it's not logged at all.
>
> I think that SOLR-12066 is indeed the cause of the problem.  The intent
> with that issue was to eliminate cores that had been deleted while the
> node was down ... but in practice, it serves to delete any core data
> that isn't in the clusterstate.
>
> It's certainly true that a well-designed ZooKeeper ensemble with a
> minimum of three nodes is extremely unlikely to lose its database, but
> somebody might use the wrong ZKHOST setting and accidentally point their
> SolrCloud install at an ensemble that exists but has no data.  Problems
> with a chroot are most likely, I think -- forgetting to add the chroot
> seems like a good way to cause this issue.
>
> I have opened an issue for the problem.
>
> https://issues.apache.org/jira/browse/SOLR-13396
>
> Thanks,
> Shawn
>


--
*Thanks and Regards,*
*Yogendra Kumar Soni*
Reply | Threaded
Open this post in threaded view
|

Re: How to prevent solr from deleting cores when getting an empty config from zookeeper

Koen De Groote
In reply to this post by Shawn Heisey-2
Thanks for that, and for your time.

Kind regards,
Koen

On Fri, Apr 12, 2019 at 3:33 AM Shawn Heisey <[hidden email]> wrote:

> On 4/11/2019 6:44 PM, Koen De Groote wrote:
> > I gathered a solr log from 7.6.0 at TRACE level.
> >
> > Then I replicated the experiment with 6.6.5 and with that version, the
> > directories were not deleted. Log also included.
> >
> > The audit log is from solr7. The deletes start at 01:51:48, which
> > translates to 23:51:48 UTC, which you'll be able to find in the solr7
> log.
> > The directories were deleted, you can see the calls in the audit logs,
> but
> > I can't identify in the solr7 log if a delete is being called somewhere.
> > Could be that it's not logged at all.
>
> I think that SOLR-12066 is indeed the cause of the problem.  The intent
> with that issue was to eliminate cores that had been deleted while the
> node was down ... but in practice, it serves to delete any core data
> that isn't in the clusterstate.
>
> It's certainly true that a well-designed ZooKeeper ensemble with a
> minimum of three nodes is extremely unlikely to lose its database, but
> somebody might use the wrong ZKHOST setting and accidentally point their
> SolrCloud install at an ensemble that exists but has no data.  Problems
> with a chroot are most likely, I think -- forgetting to add the chroot
> seems like a good way to cause this issue.
>
> I have opened an issue for the problem.
>
> https://issues.apache.org/jira/browse/SOLR-13396
>
> Thanks,
> Shawn
>