-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.1: kvserver: consider suspect stores "live" for computing quorum #68552
release-21.1: kvserver: consider suspect stores "live" for computing quorum #68552
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
Yeah, those do seem like they could be related. It should be fine to wait on merging this backport for a bit longer. |
Previously, when making the determination of whether a range could achieve quorum, the allocator ignored "suspect" stores. In other words, a range with 3 replicas would be considered unavailable for rebalancing decisions if it had 2 or more replicas on stores that are marked suspect. This meant that if a given cluster had multiple nodes missing their liveness heartbeats intermittently, operations like node decommissioning would never make progress past a certain point (the replicate queue would never decide to move replicas away because it would think their ranges are unavailable, even though they're really not). This patch fixes this by slightly altering the state transitions for how stores go in and out of "suspect" and by having the replica rebalancing code specifically ask for suspect stores to be included in the set of "live" replicas when it makes the determination of whether a given range can achieve. Release note (bug fix): A bug that was introduced in 21.1.5, which prevented nodes from decommissioning in a cluster if it had multiple nodes intermittently missing their liveness heartbeats has been fixed.
8c74677
to
5e17dbd
Compare
Backport 1/1 commits from #67714.
/cc @cockroachdb/release
Previously, when making the determination of whether a range could achieve
quorum, the allocator ignored "suspect" stores. In other words, a range with 3
replicas would be considered unavailable for rebalancing decisions if it had 2
or more replicas on stores that are marked suspect.
This meant that if a given cluster had multiple nodes missing their liveness
heartbeats intermittently, operations like node decommissioning would never
make progress past a certain point (the replicate queue would never decide to
move replicas away because it would think their ranges are unavailable, even
though they're really not).
This patch fixes this by slightly altering the state transitions for how stores
go in and out of "suspect" and by having the replica rebalancing code
specifically ask for suspect stores to be included in the set of "live"
replicas when it makes the determination of whether a given range can achieve
quorum.
Release note (bug fix): A bug that was introduced in 21.1.5, which prevented
nodes from decommissioning in a cluster if it had multiple nodes intermittently
missing their liveness heartbeats has been fixed.