Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual failover vote is not limited by two times the node timeout #1305

Merged
merged 3 commits into from
Nov 19, 2024

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Nov 14, 2024

This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.

This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.68%. Comparing base (32f7541) to head (d0bd282).
Report is 14 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1305      +/-   ##
============================================
- Coverage     70.69%   70.68%   -0.01%     
============================================
  Files           115      115              
  Lines         63153    63158       +5     
============================================
+ Hits          44643    44646       +3     
- Misses        18510    18512       +2     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.39% <100.00%> (+0.20%) ⬆️

... and 13 files with indirect coverage changes

---- 🚨 Try these New Features:

@madolson
Copy link
Member

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

I'm not sure I agree with this. I think there should be some built in timeout into the system and you should retry.

@enjoy-binbin
Copy link
Member Author

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Yes, client pause for a long time is a bad move. ♟️ ❌

We already had this comment:

/* We did not voted for a replica about this primary for two
 * times the node timeout. This is not strictly needed for correctness
 * of the algorithm but makes the base case more linear. */

Hm, not strictly needed for correctness means that it's OK to change it. It doesn't affect correctness. You added this to the same comment:

 * This limitation does not restrict manual failover. If a user initiates
 * a manual failover, we need to allow it to vote, otherwise the manual
 * failover may time out. */

I think it's safe. I like the fix. The test cases look good too. Just some nits.

Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Binbin <[email protected]>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I added a comment about a log message, because you touched it. :) We don't have to fix it though.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 17, 2024
@hwware hwware merged commit ee386c9 into valkey-io:unstable Nov 19, 2024
48 checks passed
@enjoy-binbin enjoy-binbin deleted the force_ack branch November 19, 2024 16:21
@enjoy-binbin
Copy link
Member Author

i was hoping that @PingXie and @madolson can take another look at this.

@madolson
Copy link
Member

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Yeah, I suppose that makes sense. I guess the concern is that this increases the chances of two replicas getting promoted, which can result in some data loss. If replica A takes over, after winning the election, it will start serving some traffic, then independently replica C was sent a takeover, it will likely not have time to reconfigure to A, and will take over and drop all of the writes that A received.

I suppose I'm not that worried about the new behavior since manual failovers already can cause data loss.

Hm, not strictly needed for correctness means that it's OK to change it.

You can materially degrade the end user experience but still be "correct".

@zuiderkwast
Copy link
Contributor

I suppose I'm not that worried about the new behavior since manual failovers already can cause data loss.

Can they? Can you elaborate? The documentation says "without any window for data loss". We need to fix the docs in that case. I was under the impression that it's safe to use manual failover. https://valkey.io/commands/cluster-failover/

A manual failover is a special kind of failover that is usually executed when there are no actual failures, but we wish to swap the current master with one of its replicas (which is the node we send the command to), in a safe way, without any window for data loss.

@madolson
Copy link
Member

Can they? Can you elaborate?

Well, generally we can lose tail data because we do async replication. Force and takeover can also cause dataloss, since they are just forcing their way into the primaries without being the leader.

There is also the case basically outlined here, which is that a a node could have already won the leader election for a shard but the rest of the shard doesn't know and start accepting traffic. An operator could force the replica to takeover from what i thinks is the primary, but just doesn't know there is another primary. The bound was previously the nodetimeout * 2, now we removed that guard. We could guard against this by having a failover include who it thinks own the slots it is taking over.

@enjoy-binbin
Copy link
Member Author

Well, generally we can lose tail data because we do async replication. Force and takeover can also cause dataloss, since they are just forcing their way into the primaries without being the leader.

So a plain manual failover can lose tail data? Don't we have client pause and offset wait, even though we do async replication, the wait should ensure that all traffic is consumed? I know foce and takeover can cause data loss, it is expected.

There is also the case basically outlined here, which is that a a node could have already won the leader election for a shard but the rest of the shard doesn't know and start accepting traffic. An operator could force the replica to takeover from what i thinks is the primary, but just doesn't know there is another primary.

ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.

We could guard against this by having a failover include who it thinks own the slots it is taking over.

Sorry, bad english, can you elaborate this a bit more? So you are saying, FAILOVER_AUTH_REQUESTs carry the old primary info, and we only vote for it if the old primary info is match with myself's local view. And we do the same thing in FAILOVER_AUTH_ACK, we only accept it when the ack include the old primary info and match myself's local view. I think this can solve some problem, but still we have a timing window (but rare) that not all the nodes knows the new winner.

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Nov 20, 2024
@madolson
Copy link
Member

even though we do async replication, the wait should ensure that all traffic is consumed

I thought it was possible for another replica to start and win an election while the normal failover was ongoing, start serving traffic, and then have the manual failover complete and takeover, which loses the data from the intermediary. I suppose we could say as long as there are no node failures during the failover, it doesn't lose data.

ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.

So my main concern is an automatic failover followed by a manual failover, where the manual failover takes over from the old primary and not the one from the automatic failover. I suppose this case is unlikely anyways, since the primary is both healthy (able to progress in the failover) while still being unhealthy in a way that allowed the automatic failover to happen. The new primary from the automatic failover might have started serving traffic, which might get lost from the manual failover.

@zuiderkwast
Copy link
Contributor

These are interesting cases. They seem very unlikely but they're not theoretically impossible.

As a though experiment, how hard would it be reproduce these scenarios in test cases?

@PingXie
Copy link
Member

PingXie commented Nov 25, 2024

ok i got this point, there is a timing window is that, the admin don't know there is winner, in this case, i think if the admin is doing a "force" or a "takeover", this data loss should be expected. It is race i guess.

So my main concern is an automatic failover followed by a manual failover, where the manual failover takes over from the old primary and not the one from the automatic failover. I suppose this case is unlikely anyways, since the primary is both healthy (able to progress in the failover) while still being unhealthy in a way that allowed the automatic failover to happen. The new primary from the automatic failover might have started serving traffic, which might get lost from the manual failover.

I think takeover and force are in a completely different category from automatic failover and normal manual failover. These are essentially back doors designed for last-resort scenarios, where admins/operators must be fully aware of the risks involved, including potential data loss or inconsistencies, which btw is an existing risk.

My recommendation would be for admins/operators to ensure they have a clear view of the cluster state before performing takeover or force. These commands are meant for extraordinary situations and should be used with caution and full knowledge of their implications.

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Nov 26, 2024
This is a follow of valkey-io#1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion valkey-io#1305 for more information.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way. see #1305 (review)

The new PR is #1356

enjoy-binbin added a commit that referenced this pull request Dec 15, 2024
…1356)

This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion #1305 for more information.

Signed-off-by: Binbin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…alkey-io#1356)

This is a follow of valkey-io#1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion valkey-io#1305 for more information.

Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants