Bug? Documents not visible after sucessful commit - chaos testing

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug? Documents not visible after sucessful commit - chaos testing

Michael Frank
Hi All,

In our Solr Cloud cluster (8.4.1) sometimes committed documents are not
visible to subsequent requests sent after a, apprently, sucessful
commit(waitFlush=true, wait=searcherTrue). This behaviour does not happen
if all nodes are stable, but will happen eventually if we kill off random
nodes using a chaosMonkey script.

Accordung to solrs documentation, a commit with openSearcher=true and
waitSearcher=true and waitFlush=true only returns once everything is
presisted AND the new searcher is visible.

To me this sounds like that any subsequent request after a successful
commit MUST hit the new searcher and is guaranteed to see the commit
changes, regardless of node failures or restarts.

Is this assumption on strong-consistency for commits with
openSearcher=true, waitSearcher=true and waitFlush=true correct?

If so, we discoverd a bug.

TestSetup:
============
Infrastructure

   - 3 Solr  (8.4.1) Nodes in Docker Containers
   - Each Solr node on its own Host (same hosts run the 3 Zookeeper nodes)
   - Persistent Host Volume is mounted inside the DockerContainer
   - Solr instances are pinned to host.
   - A test-collection with 1 Shard and 2 NRT Replicas.
   - Using Solrj (8.4.1)  and CloudSolrClient for communication.
   - Containers are automatically restarted on errors
   - autoCommit maxDocs10000 openSearcher=false
   - autoSoftCommit -never-
   - (We fairly often commit ourself)
   - the solrconfig.xml <https://pastebin.com/kJpGh3yj>


Scenario
After adding an initial batch of documents we perform multiple
"transactions".
Each "transaction" adds, modifys and deletes documents and we ensure that
each response has a "rf=2" (achieved replication factor=2) attribute.
A transaction has to be set atomically visible or not.
We achieve this by storing a CurrentVersion counter attribute in each
document.
This makes our life easier verifiying this corner case, as we can search
and count all documents having a specific transaction-id-counter value.
After a "transaction" was performed without errors we send first a
hardCommit and then a softCommit, both with waitFlush=true,
waitSearcher=true and ensure they both return without errors.
Only after everything happend without errors, we start to verifiy
visibility and correctness of the commited "transaction" by sending
counting queries against solr, filtering on our transaction-id-counter.
This works fine, as long as all nodes are stable. However ..

ErrorCase
If we periodically kill (SIGTERM) random solr nodes every 30, eventually
the aforementioned visibility gurantees after comit(waitFlush=true,
waitSearcher=true) break and documents that should be there/visible are not.
Sometimes this happens after minutes, somtimes it takes hours to hit this
case.

In the error case the verification counting queries return with ZERO hits.

We suspect that commits do not reach all replicas or that commits are
lost/ignored.
Unfortunatly a commit request do not include the "rf" attribute in their
responsen, which would allow us to assert the achieved replication factor.

We hope someone has an idea or clue how to fix this or why this happens, as
this is a showstopper for us and we require strong-consistency gurantees.
(Once a commit was sucessfull at time T, ALL subsequent requests after T
MUST see the new documents)


Some notes:

   - the obeserved errors can be reproduced regardless of these settings in
   the solrconfig.xml <https://pastebin.com/kJpGh3yj>
      - useColdSearcher=true/false
      - cache's autowarmCount=0 or any other value
      - Errors appear to happen more frequently if we have more load (more
   collections with the same test)


Cheers,
Michael
Reply | Threaded
Open this post in threaded view
|

Re: Bug? Documents not visible after sucessful commit - chaos testing

Chris Hostetter-3

I may be missunderstanding something in your setup, and/or I may be
miss-remembering things about Solr, but I think the behavior you are
seeing is because *search* in solr is "eventually consistent" -- while
"RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"

ie: there's a reason it's called "Near Real Time Searching" and "NRT
Replica" ... not "RT Replica"

When you kill a node hosting a replica, then send an update which a leader
accepts but can't send to that replica, that replica is now "out of sync"
and will continue to be out of sync when it comes back online and starts
responding to search requests as it recovers from the leader/tlog --
eventually the search will have consistent results across all replicas,
but during the recovery period this isn't garunteed.

If however you use the /get request handler, then it (again, IIRC)
consults the tlog for the latest version of the doc even if it's
mid-recovery and the index itself isn't yet up to date.

So for the purposes of testing solr as a "strongly consistent" document
store, using /get?id=foo to check the "current" data in the document is
more appropriate then /select?q=id:foo

Some more info here...

https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html
https://lucene.apache.org/solr/guide/8_4/realtime-get.html


A few other things that jumped out at me in your email that seemed weird
or worthy of comment...

: Accordung to solrs documentation, a commit with openSearcher=true and
: waitSearcher=true and waitFlush=true only returns once everything is
: presisted AND the new searcher is visible.
:
: To me this sounds like that any subsequent request after a successful
: commit MUST hit the new searcher and is guaranteed to see the commit
: changes, regardless of node failures or restarts.

that is true for *single* node solr, or a "heathy" cluster but as i
mentioned if a node is down when the "commit" happens it won't have the
document yet -- nor is it alive to process the commit.  the document
update -- and the commit -- are in the tlog that still needs to replay
when the replica comes back online

:    - A test-collection with 1 Shard and 2 NRT Replicas.

I'm guessing since you said you were using 3 nodes, that what you
mean here is a single shard with a total of 3 replicas which are all NRT
-- remember the "leader" is still itself an NRT  replica.  

(i know, i know ... i hate the terminology)

This is a really important point to clarify in your testing because of how
you are using 'rf' ... seeing exactly how you create your collection is
important to make sure we're talking about the same thing.

: Each "transaction" adds, modifys and deletes documents and we ensure that
: each response has a "rf=2" (achieved replication factor=2) attribute.

So to be clear: 'rf=2' means a total of 2 replicas confirmed the update --
that includes the leader replica.  'rf=1' means the leader accepted the
doc, but all other replicas are down.

if you wnat to me 100% certain that every replica recieved the update,
then you should be confirming rf=3

: After a "transaction" was performed without errors we send first a
: hardCommit and then a softCommit, both with waitFlush=true,
: waitSearcher=true and ensure they both return without errors.

FYI: three is no need to send a softCommit after a hardCommit -- a hard
commit with openSearcher=true (the default) is a super-set of a soft
commit.



-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: Bug? Documents not visible after sucessful commit - chaos testing

Michael Frank
Hi Chris,
thank you for your detailed answer!

We are aware that Solr Cloud is eventually consistent and in our
application that's fine in most cases.
However, what is really important for us is that we get a "Read Your
Writes" for a clear point in time - which in our understand should be after
hard commits with waitSearcher=true return sucessfull from all replicas. Is
that correct?
The client that indexes new documents performs a hard commit with
waitSearcher=true and after that was successful, we expect the documents to
be visible on all Replicas.
This seems to work as expected if the cluster is in a healthy state.
If we shut down nodes while updating documents and committing we observe
that commits somehow get lost.
The documents are neither visible on the leader nor on any replica! Even
after all nodes and replicas are up again.
And we don't get any error or exception from the Solrj client.
Is there any way to make sure that a commit is executed sucessfully on
_every_ replica (and fail if the replica is currently down or recovering)?
Or to get notified that the commit could not be executed because the
cluster is in an unhealthy state?
If we can confirm and verify this in our Indexing client, we could detect
failures and recover.

I don't think the /get request handler is not an option for us because it
only accepts document IDs and no search queries, which we rely heavily on.
Is that correct?


: FYI: three is no need to send a softCommit after a hardCommit
Agreed, that was just us experimenting and trying stuff.

: So to be clear: 'rf=2' means a total of 2 replicas confirmed the update
-- that includes the leader replica.  'rf=1' means the leader accepted the
doc, but all other replicas are down.
: if you wnat to me 100% certain that every replica recieved the update,
then you should be confirming rf=3
Agreed, should have been more clear. We have multiple test scenarios. Some
with 2 replicas (1 leader 1 reps) and some with 3 (1 leader, 2 reps). In
the first mail i just picked the simplest test setup that failed,
consisting of one leader and one replica - so technically we could
reproduce the error in a two node cluster.

Cheers,
Michael

Am Do., 6. Feb. 2020 um 01:42 Uhr schrieb Chris Hostetter <
[hidden email]>:

>
> I may be missunderstanding something in your setup, and/or I may be
> miss-remembering things about Solr, but I think the behavior you are
> seeing is because *search* in solr is "eventually consistent" -- while
> "RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"
>
> ie: there's a reason it's called "Near Real Time Searching" and "NRT
> Replica" ... not "RT Replica"
>
> When you kill a node hosting a replica, then send an update which a leader
> accepts but can't send to that replica, that replica is now "out of sync"
> and will continue to be out of sync when it comes back online and starts
> responding to search requests as it recovers from the leader/tlog --
> eventually the search will have consistent results across all replicas,
> but during the recovery period this isn't garunteed.
>
> If however you use the /get request handler, then it (again, IIRC)
> consults the tlog for the latest version of the doc even if it's
> mid-recovery and the index itself isn't yet up to date.
>
> So for the purposes of testing solr as a "strongly consistent" document
> store, using /get?id=foo to check the "current" data in the document is
> more appropriate then /select?q=id:foo
>
> Some more info here...
>
> https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html
> https://lucene.apache.org/solr/guide/8_4/realtime-get.html
>
>
> A few other things that jumped out at me in your email that seemed weird
> or worthy of comment...
>
> : Accordung to solrs documentation, a commit with openSearcher=true and
> : waitSearcher=true and waitFlush=true only returns once everything is
> : presisted AND the new searcher is visible.
> :
> : To me this sounds like that any subsequent request after a successful
> : commit MUST hit the new searcher and is guaranteed to see the commit
> : changes, regardless of node failures or restarts.
>
> that is true for *single* node solr, or a "heathy" cluster but as i
> mentioned if a node is down when the "commit" happens it won't have the
> document yet -- nor is it alive to process the commit.  the document
> update -- and the commit -- are in the tlog that still needs to replay
> when the replica comes back online
>
> :    - A test-collection with 1 Shard and 2 NRT Replicas.
>
> I'm guessing since you said you were using 3 nodes, that what you
> mean here is a single shard with a total of 3 replicas which are all NRT
> -- remember the "leader" is still itself an NRT  replica.
>
> (i know, i know ... i hate the terminology)
>
> This is a really important point to clarify in your testing because of how
> you are using 'rf' ... seeing exactly how you create your collection is
> important to make sure we're talking about the same thing.
>
> : Each "transaction" adds, modifys and deletes documents and we ensure that
> : each response has a "rf=2" (achieved replication factor=2) attribute.
>
> So to be clear: 'rf=2' means a total of 2 replicas confirmed the update --
> that includes the leader replica.  'rf=1' means the leader accepted the
> doc, but all other replicas are down.
>
> if you wnat to me 100% certain that every replica recieved the update,
> then you should be confirming rf=3
>
> : After a "transaction" was performed without errors we send first a
> : hardCommit and then a softCommit, both with waitFlush=true,
> : waitSearcher=true and ensure they both return without errors.
>
> FYI: three is no need to send a softCommit after a hardCommit -- a hard
> commit with openSearcher=true (the default) is a super-set of a soft
> commit.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Bug? Documents not visible after sucessful commit - chaos testing

Michael Frank
Hi Group, Hi Chris,
---
We found the Issue and a Workaround
---

We traced the problem down to DistributedUpdateProcessor.doLocalCommit()
<https://github.com/apache/lucene-solr/blob/3ae820424809baa4e5a85d8bbdb0294cf6fe5b9b/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java#L1081>
which is *silently* dropping all commits while the replica is currently
inactive and replaying, imeadiatly returns and still reports status=0.
This behaviour is inconsistent with the documented behaviour of  a commits
"waitSeacher" parameter which:  "[..] Blocks until a new searcher is opened
and registered as the main query searcher, making the changes visible."
https://lucene.apache.org/solr/guide/7_7/uploading-data-with-index-handlers.html#xml-update-commands

The issue we have is the "silent" part. If upon recieving a commit request
the replica
  - would either wait to become healthy and and then commit and return,
honoring waitSearcher=true (which is what we expected from reading the
documentation)
  - or at least behave consistently the same way as all other
UpdateRequests and report back the achieved replication factor with the
"rf" response parameter
 we could easily detect the degraded cluster state in the client and keep
re-trying the commit till "rf" matches the number of replicas.

We think this is a bug (silently dropping commits even if the client
requested "waitForSearcher"), or at least a missing feature (commits beging
the only UpdateRequests not reporting the achieved RF), which should be
worth a JIRA Ticket.
While i personally would prefere not dropping commits we are fine as long
as the client can detect this behaviour, e.g. with a "rf" response
parameter in CommitResponse.

Client Side Workaround
==================
Define a custom updateRequestProcessorChain only for commits, bypassing the
DistributingUpdateProcessor:

<updateRequestProcessorChain name="my-custom-local-force-commit-chain">
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.NoOpDistributingUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

..and manually send the following commit to each replica of the target
collection using the custom chain:
 UpdateRequest commit = new UpdateRequest();
 commit.setAction(ACTION.COMMIT, true, true, softCommit);
 commit.getParams().set(CommonParams.DISTRIB, false);//we are the
aggregator and take care of distribution
 commit.getParams().set(UpdateParams.UPDATE_CHAIN, "custom-force-commit-chain");

By bypassing the DistributingUpdateProcessor we achieve the desired
behavoiur: a successful commit OR we get a detectable error in the
client on which we can retry. Nice thing: most errors are allready
detected by HttpSolrClient, which automatically retrys on stale state.

All thrown together (not includig the
retryOnException(()->forceCommitAllReplicas(col,false)) part):
  private void forceCommitAllReplicas(String collection, boolean
softCommit) throws IOException {
        ZkStateReader zkStateReader=
solrClientCache.getCloudSolrClient(zkHost).getZkStateReader();
        Set<String> replicaRoutes =
Arrays.stream(CloudSolrStream.getSlices(collection, zkStateReader,
true))
                .flatMap(s->s.getReplicas().stream())
                .map(ZkCoreNodeProps::getCoreUrl)
                .collect(Collectors.toSet());

        UpdateRequest commit = new UpdateRequest();
        commit.setAction(ACTION.COMMIT, true/*waitFlush*/,
true/*waitSeacher*/, softCommit);
        commit.getParams().set(CommonParams.DISTRIB, false);//we are
the aggregator and take care of distribution
        commit.getParams().set(UpdateParams.UPDATE_CHAIN,
"custom-force-commit-chain");

        int achievedRF = replicaRoutes.stream()
                .map(solrClientCache::getHttpSolrClient)
                .mapToInt(client-> {
                    try {
                        UpdateResponse resp = commit.process(client);
                        return resp.getStatus()==0?1:0;//count
achieved replication factor
                    } catch (Exception e) {
                        return 0;//commit did not succeeded on this replica
                    }
                })
                .sum();

        if(achievedRF < replicaRoutes.size()){
            throw new SolrException(ErrorCode.INVALID_STATE,
                                    "Cluster is in degraded state -
not all replicas acknowledged the commit."
                                    + achievedRF + "/" +replicaRoutes.size());
        }
    }

Cheers
Michael


Am Do., 6. Feb. 2020 um 12:18 Uhr schrieb Michael Frank <
[hidden email]>:

> Hi Chris,
> thank you for your detailed answer!
>
> We are aware that Solr Cloud is eventually consistent and in our
> application that's fine in most cases.
> However, what is really important for us is that we get a "Read Your
> Writes" for a clear point in time - which in our understand should be after
> hard commits with waitSearcher=true return sucessfull from all replicas. Is
> that correct?
> The client that indexes new documents performs a hard commit with
> waitSearcher=true and after that was successful, we expect the documents to
> be visible on all Replicas.
> This seems to work as expected if the cluster is in a healthy state.
> If we shut down nodes while updating documents and committing we observe
> that commits somehow get lost.
> The documents are neither visible on the leader nor on any replica! Even
> after all nodes and replicas are up again.
> And we don't get any error or exception from the Solrj client.
> Is there any way to make sure that a commit is executed sucessfully on
> _every_ replica (and fail if the replica is currently down or recovering)?
> Or to get notified that the commit could not be executed because the
> cluster is in an unhealthy state?
> If we can confirm and verify this in our Indexing client, we could detect
> failures and recover.
>
> I don't think the /get request handler is not an option for us because it
> only accepts document IDs and no search queries, which we rely heavily on.
> Is that correct?
>
>
> : FYI: three is no need to send a softCommit after a hardCommit
> Agreed, that was just us experimenting and trying stuff.
>
> : So to be clear: 'rf=2' means a total of 2 replicas confirmed the update
> -- that includes the leader replica.  'rf=1' means the leader accepted the
> doc, but all other replicas are down.
> : if you wnat to me 100% certain that every replica recieved the update,
> then you should be confirming rf=3
> Agreed, should have been more clear. We have multiple test scenarios. Some
> with 2 replicas (1 leader 1 reps) and some with 3 (1 leader, 2 reps). In
> the first mail i just picked the simplest test setup that failed,
> consisting of one leader and one replica - so technically we could
> reproduce the error in a two node cluster.
>
> Cheers,
> Michael
>
> Am Do., 6. Feb. 2020 um 01:42 Uhr schrieb Chris Hostetter <
> [hidden email]>:
>
>>
>> I may be missunderstanding something in your setup, and/or I may be
>> miss-remembering things about Solr, but I think the behavior you are
>> seeing is because *search* in solr is "eventually consistent" -- while
>> "RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"
>>
>> ie: there's a reason it's called "Near Real Time Searching" and "NRT
>> Replica" ... not "RT Replica"
>>
>> When you kill a node hosting a replica, then send an update which a
>> leader
>> accepts but can't send to that replica, that replica is now "out of sync"
>> and will continue to be out of sync when it comes back online and starts
>> responding to search requests as it recovers from the leader/tlog --
>> eventually the search will have consistent results across all replicas,
>> but during the recovery period this isn't garunteed.
>>
>> If however you use the /get request handler, then it (again, IIRC)
>> consults the tlog for the latest version of the doc even if it's
>> mid-recovery and the index itself isn't yet up to date.
>>
>> So for the purposes of testing solr as a "strongly consistent" document
>> store, using /get?id=foo to check the "current" data in the document is
>> more appropriate then /select?q=id:foo
>>
>> Some more info here...
>>
>> https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html
>> https://lucene.apache.org/solr/guide/8_4/realtime-get.html
>>
>>
>> A few other things that jumped out at me in your email that seemed weird
>> or worthy of comment...
>>
>> : Accordung to solrs documentation, a commit with openSearcher=true and
>> : waitSearcher=true and waitFlush=true only returns once everything is
>> : presisted AND the new searcher is visible.
>> :
>> : To me this sounds like that any subsequent request after a successful
>> : commit MUST hit the new searcher and is guaranteed to see the commit
>> : changes, regardless of node failures or restarts.
>>
>> that is true for *single* node solr, or a "heathy" cluster but as i
>> mentioned if a node is down when the "commit" happens it won't have the
>> document yet -- nor is it alive to process the commit.  the document
>> update -- and the commit -- are in the tlog that still needs to replay
>> when the replica comes back online
>>
>> :    - A test-collection with 1 Shard and 2 NRT Replicas.
>>
>> I'm guessing since you said you were using 3 nodes, that what you
>> mean here is a single shard with a total of 3 replicas which are all NRT
>> -- remember the "leader" is still itself an NRT  replica.
>>
>> (i know, i know ... i hate the terminology)
>>
>> This is a really important point to clarify in your testing because of
>> how
>> you are using 'rf' ... seeing exactly how you create your collection is
>> important to make sure we're talking about the same thing.
>>
>> : Each "transaction" adds, modifys and deletes documents and we ensure
>> that
>> : each response has a "rf=2" (achieved replication factor=2) attribute.
>>
>> So to be clear: 'rf=2' means a total of 2 replicas confirmed the update
>> --
>> that includes the leader replica.  'rf=1' means the leader accepted the
>> doc, but all other replicas are down.
>>
>> if you wnat to me 100% certain that every replica recieved the update,
>> then you should be confirming rf=3
>>
>> : After a "transaction" was performed without errors we send first a
>> : hardCommit and then a softCommit, both with waitFlush=true,
>> : waitSearcher=true and ensure they both return without errors.
>>
>> FYI: three is no need to send a softCommit after a hardCommit -- a hard
>> commit with openSearcher=true (the default) is a super-set of a soft
>> commit.
>>
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Bug? Documents not visible after sucessful commit - chaos testing

Chris Hostetter-3

: We think this is a bug (silently dropping commits even if the client
: requested "waitForSearcher"), or at least a missing feature (commits beging
: the only UpdateRequests not reporting the achieved RF), which should be
: worth a JIRA Ticket.

Thanks for your analysis Michael -- I agree something better should be
done here, and have filed SOLR-14262 for subsequent discussion...

https://issues.apache.org/jira/browse/SOLR-14262

I believe the reason the local commit is ignored during replay is to
ensure a consistent view of the index -- if the tlog being
replayed contains COMMIT1,A,B,C,COMMIT2,D,... we should never open a new
searcher containing just A or just A+B w/o C if a COMMIT3 comes along
during replay -- but agree with you 100% that either commit should support
'rf' making it obvious that this commit didn't succeed (which would also
be important & helpful if the node was still down when the client sends
the commit) ... *AND* ... we should consider making the commit block until
replay is finished.

...BUT... there are probably other nuances i don't understand ...
hoepfully other folks more familiar with the current implementation will
chime in on the jira.




-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: Bug? Documents not visible after sucessful commit - chaos testing

Michael Frank
Hi Chriss,
thanks for opening the ticket. I have found some possibly related issues:
Open:
https://issues.apache.org/jira/browse/SOLR-3888  - "need beter handling of
external add/commit requests during tlog recovery"

<https://issues.apache.org/jira/browse/SOLR-3888>
Closed:
https://issues.apache.org/jira/browse/SOLR-12011
https://issues.apache.org/jira/browse/SOLR-9366

Cheers,
Michael

Am Do., 13. Feb. 2020 um 19:19 Uhr schrieb Chris Hostetter <
[hidden email]>:

>
> : We think this is a bug (silently dropping commits even if the client
> : requested "waitForSearcher"), or at least a missing feature (commits
> beging
> : the only UpdateRequests not reporting the achieved RF), which should be
> : worth a JIRA Ticket.
>
> Thanks for your analysis Michael -- I agree something better should be
> done here, and have filed SOLR-14262 for subsequent discussion...
>
> https://issues.apache.org/jira/browse/SOLR-14262
>
> I believe the reason the local commit is ignored during replay is to
> ensure a consistent view of the index -- if the tlog being
> replayed contains COMMIT1,A,B,C,COMMIT2,D,... we should never open a new
> searcher containing just A or just A+B w/o C if a COMMIT3 comes along
> during replay -- but agree with you 100% that either commit should support
> 'rf' making it obvious that this commit didn't succeed (which would also
> be important & helpful if the node was still down when the client sends
> the commit) ... *AND* ... we should consider making the commit block until
> replay is finished.
>
> ...BUT... there are probably other nuances i don't understand ...
> hoepfully other folks more familiar with the current implementation will
> chime in on the jira.
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>