Manual failover vote is not limited by two times the node timeout #1305

enjoy-binbin · 2024-11-14T17:06:09Z

This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.

This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. Signed-off-by: Binbin <[email protected]>

codecov · 2024-11-14T17:21:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.68%. Comparing base (32f7541) to head (d0bd282).
Report is 14 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1305      +/-   ##
============================================
- Coverage     70.69%   70.68%   -0.01%     
============================================
  Files           115      115              
  Lines         63153    63158       +5     
============================================
+ Hits          44643    44646       +3     
- Misses        18510    18512       +2

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.39% <100.00%> (+0.20%)`	⬆️

... and 13 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

madolson · 2024-11-14T18:12:36Z

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

I'm not sure I agree with this. I think there should be some built in timeout into the system and you should retry.

enjoy-binbin · 2024-11-15T00:21:05Z

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

zuiderkwast

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Yes, client pause for a long time is a bad move. ♟️ ❌

We already had this comment:

/* We did not voted for a replica about this primary for two
 * times the node timeout. This is not strictly needed for correctness
 * of the algorithm but makes the base case more linear. */

Hm, not strictly needed for correctness means that it's OK to change it. It doesn't affect correctness. You added this to the same comment:

 * This limitation does not restrict manual failover. If a user initiates
 * a manual failover, we need to allow it to vote, otherwise the manual
 * failover may time out. */

I think it's safe. I like the fix. The test cases look good too. Just some nits.

tests/unit/cluster/manual-failover.tcl

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>

zuiderkwast

Looks good. I added a comment about a log message, because you touched it. :) We don't have to fix it though.

src/cluster_legacy.c

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-11-19T16:23:00Z

i was hoping that @PingXie and @madolson can take another look at this.

madolson · 2024-11-19T18:23:39Z

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Yeah, I suppose that makes sense. I guess the concern is that this increases the chances of two replicas getting promoted, which can result in some data loss. If replica A takes over, after winning the election, it will start serving some traffic, then independently replica C was sent a takeover, it will likely not have time to reconfigure to A, and will take over and drop all of the writes that A received.

I suppose I'm not that worried about the new behavior since manual failovers already can cause data loss.

Hm, not strictly needed for correctness means that it's OK to change it.

You can materially degrade the end user experience but still be "correct".

src/cluster_legacy.c

zuiderkwast · 2024-11-19T20:40:46Z

I suppose I'm not that worried about the new behavior since manual failovers already can cause data loss.

Can they? Can you elaborate? The documentation says "without any window for data loss". We need to fix the docs in that case. I was under the impression that it's safe to use manual failover. https://valkey.io/commands/cluster-failover/

A manual failover is a special kind of failover that is usually executed when there are no actual failures, but we wish to swap the current master with one of its replicas (which is the node we send the command to), in a safe way, without any window for data loss.

madolson · 2024-11-19T23:15:49Z

Can they? Can you elaborate?

Well, generally we can lose tail data because we do async replication. Force and takeover can also cause dataloss, since they are just forcing their way into the primaries without being the leader.

There is also the case basically outlined here, which is that a a node could have already won the leader election for a shard but the rest of the shard doesn't know and start accepting traffic. An operator could force the replica to takeover from what i thinks is the primary, but just doesn't know there is another primary. The bound was previously the nodetimeout * 2, now we removed that guard. We could guard against this by having a failover include who it thinks own the slots it is taking over.

enjoy-binbin · 2024-11-20T02:22:41Z

Well, generally we can lose tail data because we do async replication. Force and takeover can also cause dataloss, since they are just forcing their way into the primaries without being the leader.

So a plain manual failover can lose tail data? Don't we have client pause and offset wait, even though we do async replication, the wait should ensure that all traffic is consumed? I know foce and takeover can cause data loss, it is expected.

There is also the case basically outlined here, which is that a a node could have already won the leader election for a shard but the rest of the shard doesn't know and start accepting traffic. An operator could force the replica to takeover from what i thinks is the primary, but just doesn't know there is another primary.

ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.

We could guard against this by having a failover include who it thinks own the slots it is taking over.

Sorry, bad english, can you elaborate this a bit more? So you are saying, FAILOVER_AUTH_REQUESTs carry the old primary info, and we only vote for it if the old primary info is match with myself's local view. And we do the same thing in FAILOVER_AUTH_ACK, we only accept it when the ack include the old primary info and match myself's local view. I think this can solve some problem, but still we have a timing window (but rare) that not all the nodes knows the new winner.

madolson · 2024-11-21T23:01:41Z

even though we do async replication, the wait should ensure that all traffic is consumed

I thought it was possible for another replica to start and win an election while the normal failover was ongoing, start serving traffic, and then have the manual failover complete and takeover, which loses the data from the intermediary. I suppose we could say as long as there are no node failures during the failover, it doesn't lose data.

ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.

So my main concern is an automatic failover followed by a manual failover, where the manual failover takes over from the old primary and not the one from the automatic failover. I suppose this case is unlikely anyways, since the primary is both healthy (able to progress in the failover) while still being unhealthy in a way that allowed the automatic failover to happen. The new primary from the automatic failover might have started serving traffic, which might get lost from the manual failover.

zuiderkwast · 2024-11-22T20:29:17Z

These are interesting cases. They seem very unlikely but they're not theoretically impossible.

As a though experiment, how hard would it be reproduce these scenarios in test cases?

PingXie · 2024-11-25T05:22:20Z

ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.

So my main concern is an automatic failover followed by a manual failover, where the manual failover takes over from the old primary and not the one from the automatic failover. I suppose this case is unlikely anyways, since the primary is both healthy (able to progress in the failover) while still being unhealthy in a way that allowed the automatic failover to happen. The new primary from the automatic failover might have started serving traffic, which might get lost from the manual failover.

I think takeover and force are in a completely different category from automatic failover and normal manual failover. These are essentially back doors designed for last-resort scenarios, where admins/operators must be fully aware of the risks involved, including potential data loss or inconsistencies, which btw is an existing risk.

My recommendation would be for admins/operators to ensure they have a clear view of the cluster state before performing takeover or force. These commands are meant for extraordinary situations and should be used with caution and full knowledge of their implications.

This is a follow of valkey-io#1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion valkey-io#1305 for more information. Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-11-26T04:23:37Z

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way. see #1305 (review)

The new PR is #1356

…1356) This is a follow of #1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion #1305 for more information. Signed-off-by: Binbin <[email protected]>

…alkey-io#1356) This is a follow of valkey-io#1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion valkey-io#1305 for more information. Signed-off-by: Binbin <[email protected]>

enjoy-binbin requested review from madolson, zuiderkwast and PingXie November 14, 2024 17:07

zuiderkwast reviewed Nov 15, 2024

View reviewed changes

Apply suggestions from code review

3503d11

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>

zuiderkwast approved these changes Nov 16, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

update the log message to mention the replicaof

d0bd282

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 17, 2024

hwware merged commit ee386c9 into valkey-io:unstable Nov 19, 2024
48 checks passed

enjoy-binbin deleted the force_ack branch November 19, 2024 16:21

madolson reviewed Nov 19, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Nov 20, 2024

enjoy-binbin mentioned this pull request Nov 26, 2024

Automatic failover vote is not limited by two times the node timeout #1356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual failover vote is not limited by two times the node timeout #1305

Manual failover vote is not limited by two times the node timeout #1305

enjoy-binbin commented Nov 14, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading

madolson commented Nov 14, 2024

enjoy-binbin commented Nov 15, 2024

zuiderkwast left a comment

zuiderkwast left a comment

enjoy-binbin commented Nov 19, 2024

madolson commented Nov 19, 2024

zuiderkwast commented Nov 19, 2024

madolson commented Nov 19, 2024

enjoy-binbin commented Nov 20, 2024

madolson commented Nov 21, 2024

zuiderkwast commented Nov 22, 2024

PingXie commented Nov 25, 2024

enjoy-binbin commented Nov 26, 2024

Manual failover vote is not limited by two times the node timeout #1305

Manual failover vote is not limited by two times the node timeout #1305

Conversation

enjoy-binbin commented Nov 14, 2024 • edited Loading

codecov bot commented Nov 14, 2024 • edited Loading

Codecov Report

madolson commented Nov 14, 2024

enjoy-binbin commented Nov 15, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

enjoy-binbin commented Nov 19, 2024

madolson commented Nov 19, 2024

zuiderkwast commented Nov 19, 2024

madolson commented Nov 19, 2024

enjoy-binbin commented Nov 20, 2024

madolson commented Nov 21, 2024

zuiderkwast commented Nov 22, 2024

PingXie commented Nov 25, 2024

enjoy-binbin commented Nov 26, 2024

enjoy-binbin commented Nov 14, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading