[jira] [Commented] (SOLR-11216) Make PeerSync more robust

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (SOLR-11216) Make PeerSync more robust

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511004#comment-16511004 ]

Cao Manh Dat commented on SOLR-11216:

After spent a day adding more test and debugging problem. I think that with the current IndexFingerprint implementation we can't go with Solution 3.
Firstly, to go with Solution 3, we must compute the fingerprint of the index up to a specified point. But just by looking at the current index, we can't do that. Ie:
A leader :
- with updates: doc1(v=0), doc2(v=1), doc3(v=3), delete(doc3, v=4), doc2(v=5).
- its index will be: doc1(v=0), doc2(v=5)

A replica :
- with index: doc1(v=0), doc2(v=1)

Case 1:
A replica asks for updates and fingerprint up to (include) v=3. The Leader will return updates doc3(v=3)
- leader's fingerprint will be hash of doc1(v=0) (it will skip doc2, since its version = 5 > specified version 3)
- replica' fingerprint will be hash of  doc1(v=0), doc2(v=1), doc3(v=3)
-> incorrect fingerprint.

There are many other cases which are very tricky to solve. Therefore I think the best thing to do now is Solution 2.

> Make PeerSync more robust
> -------------------------
>                 Key: SOLR-11216
>                 URL: https://issues.apache.org/jira/browse/SOLR-11216
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-11216.patch
> First of all, I will change the issue's title with a better name when I have.
> When digging into SOLR-10126. I found a case that can make peerSync fail.
> * leader and replica receive update from 1 to 4
> * replica stop
> * replica miss updates 5, 6
> * replica start recovery
> ## replica buffer updates 7, 8
> ## replica request versions from leader,
> ## in the same time leader receive update 9, so it will return updates from 1 to 9 (for request versions) when replica get recent versions ( so it will be 1,2,3,4,5,6,7,8,9 )
> ## replica do peersync and request updates 5, 6, 9 from leader
> ## replica apply updates 5, 6, 9. Its index does not have update 7, 8 and maxVersionSpecified for fingerprint is 9, therefore compare fingerprint will fail
> My idea here is why replica request update 9 (step 6) while it knows that updates with lower version ( update 7, 8 ) are on its buffering tlog. Should we request only updates that lower than the lowest update in its buffering tlog ( < 7 )?
> Someone my ask that what if replica won't receive update 9. In that case, leader will put the replica into LIR state, so replica will run recovery process again.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]