-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: gracefully handle races during learner removals #79405
kvserver: gracefully handle races during learner removals #79405
Conversation
4d1607f
to
2559fcc
Compare
cc @tbg, feel free to ignore this, but I wanted to add you in case you see something obviously wrong with doing what we're doing in this patch. |
Looks good! Not formally signing off because I have late night brain mush but I did leave some suggestions and can give a real review when the next turnaround comes. |
394efdf
to
1efaf60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did leave some suggestions
Hm, I'm not seeing these suggestions.
Reviewed 1 of 2 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @tbg)
pkg/kv/kvserver/replica_command.go, line 2068 at r2 (raw file):
return nil, errors.AssertionFailedf("cannot transition from %s to VOTER_OUTGOING", prevTyp) } else { rDesc, _, _ = updatedDesc.SetReplicaType(chg.target.NodeID, chg.target.StoreID, roachpb.VOTER_OUTGOING)
Is this case still valid? Or does the check in execReplicationChangesForVoters
that switches from a internalChangeTypeRemoveLearner
to a internalChangeTypeDemoteVoterToLearner
make it dead code?
pkg/kv/kvserver/replica_command.go, line 2210 at r2 (raw file):
learnerAlreadyRemoved := true check := func(kvDesc *roachpb.RangeDescriptor) bool {
nit: could we name this return arg?
pkg/kv/kvserver/replica_command.go, line 2230 at r2 (raw file):
for _, repl := range kvDesc.Replicas().Descriptors() { if repl.StoreID == chgs[0].target.StoreID { learnerAlreadyRemoved = false
nit: break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry hadn't hit the submit button
pkg/kv/kvserver/replica_command.go
Outdated
@@ -2202,6 +2232,13 @@ func execChangeReplicasTxn( | |||
returnDesc = desc | |||
return nil | |||
} | |||
if chgs.isSingleLearnerRemoval() && learnerAlreadyRemoved { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol on line 2218 above.
pkg/kv/kvserver/replica_command.go
Outdated
@@ -2169,19 +2182,36 @@ func execChangeReplicasTxn( | |||
|
|||
descKey := keys.RangeDescriptorKey(referenceDesc.StartKey) | |||
|
|||
learnerAlreadyRemoved := true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit iffy tbh. Relying on a closure that is invoked a few call stacks down and its ordering here is fairly intransparent. I spent around ten minutes trying to think of an elegant way to refactor these checks, but came up short. But I do feel sort of strongly that we should avoid the opaque mutation to learnerAlreadyRemoved
. Perhaps you can pull out a closure that we invoke in both places? Something like
isIdempotentLearnerRemoval := func(actual *roachpb.RangeDescriptor, storeID roachpb.StoreID) bool {
// ...
}
You can invoke that in check
, and again when short-circuiting the txn.
But maybe we can do one better, and return a skip bool
from check
and pass it through from conditionalGetDescValueFromDB
?
desc, dbDescValue, skip, err := conditionalGetDescValueFromDB(
ctx, txn, referenceDesc.StartKey, false /* forUpdate */, check)
if err != nil { ... }
if skip {
// The new descriptor already reflects what we needed to get done.
return nil
}
If this fits all existing uses, that would be a nice solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the second option you suggested, how do you feel about it?
1efaf60
to
03d2b33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 2068 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this case still valid? Or does the check in
execReplicationChangesForVoters
that switches from ainternalChangeTypeRemoveLearner
to ainternalChangeTypeDemoteVoterToLearner
make it dead code?
Since VOTER_OUTGOING
s are not a thing anymore, and neither are non-atomic replication changes, this area as a whole could use a proper cleanup. I considered adding a clean-up commit at the end, but there are a few test flakes I'd need to iron out. Do you mind if we do this cleanup separately? I'd also like to backport this, so it's probably better to keep this patch small.
pkg/kv/kvserver/replica_command.go, line 2210 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: could we name this return arg?
Done.
pkg/kv/kvserver/replica_command.go, line 2230 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
break
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @tbg)
pkg/kv/kvserver/replica_command.go, line 2068 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Since
VOTER_OUTGOING
s are not a thing anymore, and neither are non-atomic replication changes, this area as a whole could use a proper cleanup. I considered adding a clean-up commit at the end, but there are a few test flakes I'd need to iron out. Do you mind if we do this cleanup separately? I'd also like to backport this, so it's probably better to keep this patch small.
Yes, let's absolutely keep this separate.
pkg/kv/kvserver/replica_command.go, line 2710 at r3 (raw file):
// The supplied `check` method verifies whether the descriptor fetched from kv // matches expectations and returns a "skip" hint that the caller can use to // optionally elide work that doesn't need to be done.
We should mention here that if a function returns skip = true
, matched
should also be true. Maybe even assert that below.
Or we could change this contract to require the opposite (skip = true => matches = false
). It feels somewhat strange that we're pretending something matched when it did not. If we could express the condition "did not match, but that's ok", we'd be saying exactly what we mean.
@tbg might have thoughts.
pkg/kv/kvserver/replica_command.go, line 2747 at r3 (raw file):
existingDescKV, err := get(ctx, descKey) if err != nil { return nil, nil, skip, errors.Wrap(err, "fetching current range descriptor value")
nit: return false
directly so that readers don't need to check whether skip
has already been set. Same thing a few lines down.
03d2b33
to
44052fd
Compare
Previously, at the end of a replication change, if the `ChangeReplicas` request found that the (demoted) learner replica it was trying to remove from the range had already been removed (presumably because it raced with the mergeQueue, the `StoreRebalancer`, or something else), it would error out. This was unfortunate, because, for all practical purposes, the replication change _had succeeded_. We can now gracefully handle this instead by no-oping if we detect that the replica we were trying to remove has already been removed. Release note: None
44052fd
to
8b34184
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_command.go, line 2710 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should mention here that if a function returns
skip = true
,matched
should also be true. Maybe even assert that below.Or we could change this contract to require the opposite (
skip = true => matches = false
). It feels somewhat strange that we're pretending something matched when it did not. If we could express the condition "did not match, but that's ok", we'd be saying exactly what we mean.@tbg might have thoughts.
I went with the second option you proposed. How do you feel about it? Did you want me to also assert that when skip==true
that matched
must be false
?
pkg/kv/kvserver/replica_command.go, line 2747 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: return
false
directly so that readers don't need to check whetherskip
has already been set. Same thing a few lines down.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me, letting Nathan put the 👔 on it
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/replica_command.go, line 2710 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I went with the second option you proposed. How do you feel about it? Did you want me to also assert that when
skip==true
thatmatched
must befalse
?
Looks good to me, thanks for chewing through this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 4 of 5 files at r2, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
TFTRs bors r+ |
I think we could reasonably wait to backport this until the first or second patch release since the race this PR fixes isn't nearly as likely as #79379. |
Build succeeded: |
…locateRange` Previously, after cockroachdb#79405, `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` could return a range descriptor that contained learner replicas. This was because of the idempotency introduced in cockroachdb#79405, which relaxed how/when we failed replication changes due to descriptor mismatch. The hazard was that, while we would ensure that the learner we were trying to remove was indeed gone, we wouldn't do enough to ensure that there weren't new learners that were added to the range due to other concurrent `AdminRelocateRange` calls. This was a violation of its contract and could sometimes fail `AdminRelocateRange` requests. This commit makes it such that when `maybeLeaveAtomicChangeReplicasAndRemoveLearners` returns a non-error response, it guarantees that _all_ the learner replicas have been removed according to the returned range descriptor. Release note: None
This commit is an alternative to cockroachdb#80205. In cockroachdb#79405, we relaxed the check that `AdminChangeReplicas` performs comparing the expected range descriptor state against the range descriptor in KV when we were performing single learner removals. This was because these learner removals are performed at the end of a replication change (i.e. after its already completed) and thus, could be made idempotent. The above change messed with a subtle part of the contract of `maybeLeaveAtomicChangeReplicasAndRemoveLearners()` since now it was no longer guaranteed to always successfully return with a range desc that did not contain learners (due to intervening replication changes that were now no longer detected because of the relaxation mentioned above). This commit fixes this regression by making `maybeLeaveAtomicChangeReplicasAndRemoveLearners` check at the end if we've raced with another replication change. If so, we return an error instead of incorrectly returning a successful response with a descriptor that could still have learners or be in a joint config. Release note: None
Previously, at the end of a replication change, if the
ChangeReplicas
requestfound that the (demoted) learner replica it was trying to remove from the range
had already been removed (presumably because it raced with the mergeQueue, the
StoreRebalancer
, or something else), it would error out. This wasunfortunate, because, for all practical purposes, the replication change had
succeeded.
We can now gracefully handle this instead by no-oping if we detect that the
replica we were trying to remove has already been removed.
Relates to #79118
Release note: None
Jira issue: CRDB-14809