-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual failover vote is not limited by two times the node timeout #1305
Conversation
This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. Signed-off-by: Binbin <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1305 +/- ##
============================================
- Coverage 70.69% 70.68% -0.01%
============================================
Files 115 115
Lines 63153 63158 +5
============================================
+ Hits 44643 44646 +3
- Misses 18510 18512 +2
|
I'm not sure I agree with this. I think there should be some built in timeout into the system and you should retry. |
The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move
Yes, client pause for a long time is a bad move. ♟️ ❌
We already had this comment:
/* We did not voted for a replica about this primary for two
* times the node timeout. This is not strictly needed for correctness
* of the algorithm but makes the base case more linear. */
Hm, not strictly needed for correctness means that it's OK to change it. It doesn't affect correctness. You added this to the same comment:
* This limitation does not restrict manual failover. If a user initiates
* a manual failover, we need to allow it to vote, otherwise the manual
* failover may time out. */
I think it's safe. I like the fix. The test cases look good too. Just some nits.
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I added a comment about a log message, because you touched it. :) We don't have to fix it though.
Signed-off-by: Binbin <[email protected]>
Yeah, I suppose that makes sense. I guess the concern is that this increases the chances of two replicas getting promoted, which can result in some data loss. If replica A takes over, after winning the election, it will start serving some traffic, then independently replica C was sent a takeover, it will likely not have time to reconfigure to A, and will take over and drop all of the writes that A received. I suppose I'm not that worried about the new behavior since manual failovers already can cause data loss.
You can materially degrade the end user experience but still be "correct". |
Can they? Can you elaborate? The documentation says "without any window for data loss". We need to fix the docs in that case. I was under the impression that it's safe to use manual failover. https://valkey.io/commands/cluster-failover/
|
Well, generally we can lose tail data because we do async replication. Force and takeover can also cause dataloss, since they are just forcing their way into the primaries without being the leader. There is also the case basically outlined here, which is that a a node could have already won the leader election for a shard but the rest of the shard doesn't know and start accepting traffic. An operator could force the replica to takeover from what i thinks is the primary, but just doesn't know there is another primary. The bound was previously the nodetimeout * 2, now we removed that guard. We could guard against this by having a failover include who it thinks own the slots it is taking over. |
So a plain manual failover can lose tail data? Don't we have client pause and offset wait, even though we do async replication, the wait should ensure that all traffic is consumed? I know foce and takeover can cause data loss, it is expected.
ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.
Sorry, bad english, can you elaborate this a bit more? So you are saying, FAILOVER_AUTH_REQUESTs carry the old primary info, and we only vote for it if the old primary info is match with myself's local view. And we do the same thing in FAILOVER_AUTH_ACK, we only accept it when the ack include the old primary info and match myself's local view. I think this can solve some problem, but still we have a timing window (but rare) that not all the nodes knows the new winner. |
I thought it was possible for another replica to start and win an election while the normal failover was ongoing, start serving traffic, and then have the manual failover complete and takeover, which loses the data from the intermediary. I suppose we could say as long as there are no node failures during the failover, it doesn't lose data.
So my main concern is an automatic failover followed by a manual failover, where the manual failover takes over from the old primary and not the one from the automatic failover. I suppose this case is unlikely anyways, since the primary is both healthy (able to progress in the failover) while still being unhealthy in a way that allowed the automatic failover to happen. The new primary from the automatic failover might have started serving traffic, which might get lost from the manual failover. |
These are interesting cases. They seem very unlikely but they're not theoretically impossible. As a though experiment, how hard would it be reproduce these scenarios in test cases? |
I think My recommendation would be for admins/operators to ensure they have a clear view of the cluster state before performing |
This is a follow of valkey-io#1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion valkey-io#1305 for more information. Signed-off-by: Binbin <[email protected]>
So the conclusion is that we all agreed to remove it completely, The new PR is #1356 |
…1356) This is a follow of #1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion #1305 for more information. Signed-off-by: Binbin <[email protected]>
…alkey-io#1356) This is a follow of valkey-io#1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion valkey-io#1305 for more information. Signed-off-by: Binbin <[email protected]>
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.
For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.
The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.