Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic failover vote is not limited by two times the node timeout #1356

Merged
merged 1 commit into from
Dec 15, 2024

Conversation

enjoy-binbin
Copy link
Member

This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion #1305 for more information.

This is a follow of valkey-io#1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion valkey-io#1305 for more information.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.76%. Comparing base (469d41f) to head (2c1c9cf).
Report is 52 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1356      +/-   ##
============================================
+ Coverage     70.62%   70.76%   +0.13%     
============================================
  Files           117      117              
  Lines         63313    63308       -5     
============================================
+ Hits          44716    44801      +85     
+ Misses        18597    18507      -90     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.60% <ø> (+0.04%) ⬆️

... and 11 files with indirect coverage changes

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @enjoy-binbin!

@zuiderkwast
Copy link
Contributor

@enjoy-binbin and @PingXie you seem convinced it shall be deleted. I didn't think about this enough to really understand it, but I trust you. The code looks good.

I hope @madolson will confirm.

@enjoy-binbin
Copy link
Member Author

for the record, Madelyn mention in here #1305 (comment) that she agree with removing this

I guess I agree with removing it. It seems like it will make the code a lot simpler.

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Dec 8, 2024
@enjoy-binbin enjoy-binbin merged commit ad24220 into valkey-io:unstable Dec 15, 2024
48 checks passed
@enjoy-binbin enjoy-binbin deleted the voted_time branch December 15, 2024 04:09
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…alkey-io#1356)

This is a follow of valkey-io#1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion valkey-io#1305 for more information.

Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants