Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: consider suspect stores "live" for computing quorum #67714

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Jul 16, 2021

Previously, when making the determination of whether a range could achieve
quorum, the allocator ignored "suspect" stores. In other words, a range with 3
replicas would be considered unavailable for rebalancing decisions if it had 2
or more replicas on stores that are marked suspect.

This meant that if a given cluster had multiple nodes missing their liveness
heartbeats intermittently, operations like node decommissioning would never
make progress past a certain point (the replicate queue would never decide to
move replicas away because it would think their ranges are unavailable, even
though they're really not).

This patch fixes this by slightly altering the state transitions for how stores
go in and out of "suspect" and by having the replica rebalancing code
specifically ask for suspect stores to be included in the set of "live"
replicas when it makes the determination of whether a given range can achieve
quorum.

Release note (bug fix): A bug that was introduced in 21.1.5, which prevented
nodes from decommissioning in a cluster if it had multiple nodes intermittently
missing their liveness heartbeats has been fixed.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lunevalex
Copy link
Collaborator

This LGTM, thank you for making the change. Would be curious to hear what @nvanbenschoten has to say.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me. It does make me wonder about other cases where this dependency between a range and the liveness of each of its replicas can go wrong. Are there cases where a replica being classified with a store status of storeStatusUnknown would cause unexpected behavior due to these quorum checks?

Also, could you link the support issue (if there is one) that this would address to this PR?

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator.go, line 552 at r1 (raw file):

	// stores to be unavailable, just because their nodes have failed a liveness
	// heartbeat in the recent past. This means we won't move those replicas
	// elsewhere (for a regular rebalance or for decommissioning).

nit: create a const includeSuspectStores = true so that you can have the later call to liveAndDeadReplicas reference this comment as well.


pkg/kv/kvserver/replica_command.go, line 2142 at r1 (raw file):

			//
			// Note that the allocator will avoid rebalancing to stores that are
			// currently marked suspect.

Mind pointing at where this logic lives?


pkg/kv/kvserver/store_pool.go, line 247 at r1 (raw file):

	now time.Time, threshold time.Duration, nl NodeLivenessFunc, suspectDuration time.Duration,
) storeStatus {
	// During normal operation, we expect the state transitions for stores to look like the following:

This is a very good idea for a state diagram. Should we add in the other states as well?


pkg/kv/kvserver/store_pool.go, line 274 at r1 (raw file):

		// We don't want to suspect a node on startup or when it's first added to a
		// cluster, because we dont know it's liveness yet. A node is only considered
		// suspect if it's been alive and fails to heartbeat liveness.

Is this is separate change? If so, could you help me understand the reason for it? Is it that we don't want to consider nodes with a liveness status of UNAVAILABLE as live for the purposes of rebalancing?

Does doing so have other effects? Or put another way, do we understand why this was written like this in the first place?


pkg/kv/kvserver/store_pool.go, line 665 at r1 (raw file):

// first for live replicas, and the second for dead replicas.
//
// - Replicas for  which liveness or deadness cannot be ascertained

nit: double space

@aayushshah15 aayushshah15 force-pushed the 20210715_countSuspectStoresTowardsQuorum branch 2 times, most recently from a0a8923 to 20fd356 Compare July 18, 2021 20:49
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you link the support issue (if there is one) that this would address to this PR?

This isn't part of an explicit support issue. I was on a call with a customer who was having trouble decommissioning nodes out of their cluster (which had a bunch of nodes intermittently missing liveness heartbeats). We saw the following when we looked at the allocator logs for one of the ranges that hadn't been moved off a decommissioning node:
image.png

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 552 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: create a const includeSuspectStores = true so that you can have the later call to liveAndDeadReplicas reference this comment as well.

Done.


pkg/kv/kvserver/replica_command.go, line 2142 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mind pointing at where this logic lives?

The rebalancing code filters stores by storeFilterThrottled here:

addTarget, removeTarget, details, ok := rq.allocator.RebalanceVoter(
ctx,
zone,
repl.RaftStatus(),
existingVoters,
existingNonVoters,
rangeUsageInfo,
storeFilterThrottled,
)
.

storeFilterThrottled, in turn, filters out stores that are throttled as well as stores that are marked suspect here:

case storeStatusSuspect:
aliveStoreCount++
throttled = append(throttled, "throttled because the node is considered suspect")
if filter != storeFilterThrottled && filter != storeFilterSuspect {
storeDescriptors = append(storeDescriptors, *detail.desc)
}


pkg/kv/kvserver/store_pool.go, line 247 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a very good idea for a state diagram. Should we add in the other states as well?

I tried that at first but I felt that it got a lot harder to read and didn't add as much value (for instance, all these states can directly transition to decommissioning -- which isn't surprising but makes the diagram uglier).

Let me know if you'd still like me to make this comprehensive.


pkg/kv/kvserver/store_pool.go, line 274 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this is separate change? If so, could you help me understand the reason for it? Is it that we don't want to consider nodes with a liveness status of UNAVAILABLE as live for the purposes of rebalancing?

Does doing so have other effects? Or put another way, do we understand why this was written like this in the first place?

Before this patch, when a node failed a liveness heartbeat and had a UNAVAILABLE liveness record, it would directly go from storeStatusAvailable to storeStatusSuspect. In other words, storeStatusUnknown was only being used immediately after a store joined a cluster.

With this patch, when a node has a UNAVAILABLE liveness record, it will be in storeStatusUnknown until its first successful heartbeat. The first successful heartbeat takes a store in storeStatusUnknown to storeStatusSuspect.

Essentially, with this patch, we're disambiguating between stores that are currently dead (storeStatusUnknown) and stores that have failed a liveness heartbeat in the recent past (storeStatusSuspect).


pkg/kv/kvserver/store_pool.go, line 665 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: double space

Fixed.

Previously, when making the determination of whether a range could achieve
quorum, the allocator ignored "suspect" stores. In other words, a range with 3
replicas would be considered unavailable for rebalancing decisions if it had 2
or more replicas on stores that are marked suspect.

This meant that if a given cluster had multiple nodes missing their liveness
heartbeats intermittently, operations like node decommissioning would never
make progress past a certain point (the replicate queue would never decide to
move replicas away because it would think their ranges are unavailable, even
though they're really not).

This patch fixes this by slightly altering the state transitions for how stores
go in and out of "suspect" and by having the replica rebalancing code
specifically ask for suspect stores to be included in the set of "live"
replicas when it makes the determination of whether a given range can achieve.

Release note (bug fix): A bug that was introduced in 21.1.5, which prevented
nodes from decommissioning in a cluster if it had multiple nodes intermittently
missing their liveness heartbeats has been fixed.
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r2, 2 of 3 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/store_pool.go, line 247 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I tried that at first but I felt that it got a lot harder to read and didn't add as much value (for instance, all these states can directly transition to decommissioning -- which isn't surprising but makes the diagram uglier).

Let me know if you'd still like me to make this comprehensive.

👍


pkg/kv/kvserver/store_pool.go, line 274 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Before this patch, when a node failed a liveness heartbeat and had a UNAVAILABLE liveness record, it would directly go from storeStatusAvailable to storeStatusSuspect. In other words, storeStatusUnknown was only being used immediately after a store joined a cluster.

With this patch, when a node has a UNAVAILABLE liveness record, it will be in storeStatusUnknown until its first successful heartbeat. The first successful heartbeat takes a store in storeStatusUnknown to storeStatusSuspect.

Essentially, with this patch, we're disambiguating between stores that are currently dead (storeStatusUnknown) and stores that have failed a liveness heartbeat in the recent past (storeStatusSuspect).

Got it, thanks. This was very helpful.

@aayushshah15
Copy link
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 28, 2021

Build succeeded:

@craig craig bot merged commit 8009ac6 into cockroachdb:master Jul 28, 2021
@blathers-crl
Copy link

blathers-crl bot commented Jul 28, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 295338b to blathers/backport-release-21.1-67714: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Aug 12, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Aug 13, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add any guidance for the new argument to liveAndDeadReplicas()? Asking because... I wouldn't know what to pass for it. The behavior of blocking a drain when other nodes are having trouble doesn't seem too bad to me - depending on how long a store is "suspect" following a missed heartbeat. Does a store stop being suspect as soon as it successfully heartbeats?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@aayushshah15
Copy link
Contributor Author

Could we add any guidance for the new argument to liveAndDeadReplicas()? Asking because... I wouldn't know what to pass for it.

Yeah, I'll make it clearer. I'll do it in #68652, which is a related patch (added you as a reviewer in case you want a look).

Does a store stop being suspect as soon as it successfully heartbeats?

The store actually enters the suspect state (from storeStatusUnknown -- which indicates a DEAD node liveness record) after its first successful heartbeat. Iff it doesn't miss any future liveness heartbeats for the time_after_store_suspect duration (default is 30s), it stops being suspect.

The behaviour of blocking a drain when other nodes are having trouble doesn't seem too bad to me

This bug meant that when a cluster (however large) was under duress (say 2 or more nodes struggling to heartbeat due to poor LSM), the user wouldn't be able to gracefully take those nodes out of the cluster at all.

@andreimatei
Copy link
Contributor

The behaviour of blocking a drain when other nodes are having trouble doesn't seem too bad to me

This bug meant that when a cluster (however large) was under duress (say 2 or more nodes struggling to heartbeat due to poor LSM), the user wouldn't be able to gracefully take those nodes out of the cluster at all.

Why is this exactly? What's the connection between removing replicas from a decommissioning node and the allocator logic for determining whether a particular range has quorum or not?

@aayushshah15
Copy link
Contributor Author

What's the connection between removing replicas from a decommissioning node and the allocator logic for determining whether a particular range has quorum or not?

So the way this is all supposed to work is that when a node is marked as decommissioning and all the stores learn about it via gossip, their allocators will all emit this action. Those stores' replicateQueues will then move replicas off of the decommissioning nodes.

If the allocator thinks that a range does not have quorum, it will emit this action instead, which leads to some of the "stuck" decommissioning issues some customers have seen.

@andreimatei
Copy link
Contributor

I see.
I'm still thinking what the right thing is here. The comment on the no-quorum case is just:

// Do not take any replacement/removal action if we do not have a quorum of
// live voters.

Unfortunately the comment doesn't go into the motivation. I wonder if the point of that check was to protect against a range becoming even "more unavailable" if the allocator was to remove a voter, or if the check is just meant to remove noise from proposing changes that just block. Since atomic voter swaps, I think we no longer "replicate into unavailability", so I wonder if the check can simply go away altogether.

@aayushshah15
Copy link
Contributor Author

aayushshah15 commented Aug 16, 2021

if the check is just meant to remove noise from proposing changes that just block.

Yeah, this is my understanding. I don't see a case where we'd be at risk of further downreplication on an already unavailable range.

Since atomic voter swaps, I think we no longer "replicate into unavailability", so I wonder if the check can simply go away altogether.

We could probably remove that check and it will likely be copacetic -- those checks are meant just as generally defensive safeguards. However, we'll still have bugs as long as the slice of liveVoters is incorrect.

The motivation for this patch was that we know that suspect stores are currently live. They've just failed a liveness heartbeat at some point in the recent past. So, as long as we're not transferring new leases or replicas to such nodes, it doesn't make sense to consider them dead. This is roughly the rationale for the first commit from #68652 as well.

@andreimatei
Copy link
Contributor

The motivation for this patch was that we know that suspect stores are currently live. They've just failed a liveness heartbeat at some point in the recent past. So, as long as we're not transferring new leases or replicas to such nodes, it doesn't make sense to consider them dead. This is roughly the rationale for the first commit from #68652 as well.

Yes, I buy that. I'm generally unsure that the suspect state is a good idea.
So I don't think that suspect stores should hold up decommissioning. Even more, I wonder if "unknown" stores should hold it up. If the decommissioning store/node is itself "unknown", and there's another "unknown" replica, this patch leave that situation to block the decommissioning. But should it? That's why I was thinking that perhaps it'd be better and also simpler if those checks would simply go away. Or is perhaps impossible for the local store to be unknown?

aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Aug 21, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Aug 21, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note(bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Aug 22, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note (bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Aug 30, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note (bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Sep 2, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
craig bot pushed a commit that referenced this pull request Sep 2, 2021
68652: kvserver: count draining nodes as live when computing quorum r=aayushshah15 a=aayushshah15

Similar to #67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission _any other_ nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
subset of "stuck decommissioning" issues we've seen in the wild.

Follows from cockroachlabs/support#1105

Release justification: bug fix

Release note(bug fix): Previously, draining a quorum of nodes (i.e. >=2 if
replication factor is 3, >=3 if replication factor is 5, etc) would block the
subsequent decommissioning of any other nodes in the cluster. This patch fixes
this bug. Now, if the lowest replication factor of some zone in the cluster is
RF, operators should be able to safely drain up to RF-1 nodes simultaneously.

69115: mt_start_sql: enable enterprise features for multitenant sql servers r=JeffSwenson a=JeffSwenson

Enterprise features are controlled by the enterprise.license setting.
Currently this setting applies only to the host tenant cluster. This
change enables enterprise features for all tenant clusters. Enabling
enterprise features for all tenants is reasonable, because multi
tenant deployments are an enterprise feature.

Release note: None

69571: jobs: don't reset next_run when resuming active schedules r=pbardea a=pbardea

Consider an active schedule that was created with a specific first_run
time. The first_run would populate the next_run_time on the schedule.
The resumption of this schedule before it executed would re-evaluate the
next runtime based off the schedule's recurrence.

This commit changes the scheduling system to only recompute the next run
time on paused schedules.

Release justification: bug fix
Release note (bug fix): Fix a bug where resuming an active schedule
would always reset its next run time. This was sometimes undesirable
with schedules that had a first_run option specified.

69696: kvserver: stop transferring leases to replicas that may need snapshots r=aayushshah15 a=aayushshah15

This commit disallows the `replicateQueue` from initiating lease
transfers to replicas that may be in need of a raft snapshot. Note that
the `StoreRebalancer` already has a stronger form of this check since it
disallows lease transfers to replicas that are lagging behind the raft
leader (which includes the set of replicas that need a snapshot).

In cases where the raft leader is not the leaseholder, we disallow the
replicateQueue from any sort of lease transfer until leaseholdership and
leadership are collocated. We rely on calls to
`maybeTransferRaftLeadershipToLeaseholderLocked()` (called on every raft
tick) to make sure that such periods of leadership / leaseholdership
misalignment are ephemeral and rare.

Alternative to #63507

Release justification: bug fix

Resolves #61604

Release note (bug fix): Fixes a bug that can cause prolonged
unavailability due to lease transfer to a replica that may be in need of
a raft snapshot.

69727: kv: deflake TestPriorityRatchetOnAbortOrPush r=nvanbenschoten a=nvanbenschoten

Fixes #68584.

The test was flaky for the reasons described in #68584. There doesn't appear to
be an easy way to fix this behavior, and it's not clear how valuable doing so
even is given how little we rely on transaction priorities anymore, so the
commit just deflakes the test by rejecting them.

Release justification: deflaking a test.

69728: clusterversion,kv: remove old cluster versions corresponding to 20.2 and 21.1 r=nvanbenschoten a=nvanbenschoten

Fixes most of #66544.

This pull request removes (almost) all old cluster versions corresponding to 20.2 and 21.1 which fall under the KV group's responsibility.

The PR leaves one remaining "combo version" of `TruncatedAndRangeAppliedStateMigration` plus `PostTruncatedAndRangeAppliedStateMigration`, which we'll want to remove before closing #66544. I've left this for @irfansharif both because its removal is a little more involved than the others and because I figured that he deserves to reap the fruit of his labor and get to delete the code related to the replicated truncated state and missing RangeAppliedStateKeys. Making that possible was a hard-earned win.

Release justification: cluster version cleanup

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Jeff <[email protected]>
Co-authored-by: Paul Bardea <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Sep 9, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Sep 9, 2021
Similar to cockroachdb#67714

Draining nodes were considered non-live by the allocator when it made the
determination of whether a range could achieve quorum. This meant that, for
instance, on a cluster with a replication factor of 5, if we had 3 or more
nodes marked draining, we (with a high likelihood) wouldn't be able to
decommission nodes from the cluster.

Furthermore, due to the same reason as above the system also would incorrectly
decide to not rebalance ranges that had more than a quorum of replicas on
draining nodes.

This patch fixes this problem by considering replicas on draining nodes as live
for the purposes of determining whether a range has quorum. This likely fixes a
considerable subset of "stuck decommissioning" issues we've seen in the wild.

Follows from https://github.com/cockroachlabs/support/issues/1105

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants