-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: check leaseholder status in AdminSplit retry loop #23762
storage: check leaseholder status in AdminSplit retry loop #23762
Conversation
Well, the intent was to return the persistent error up into the log file or jobs table where it would be visible to the user. Granted, I botched that 🙈, but I'm still nervous about retry loops that never bail out. Is this concern misplaced? This edge case went unnoticed for a while; I'm worried there might be another. |
I think the limit on the retry loop should stay. Reviewed 3 of 3 files at r1, 2 of 2 files at r2. pkg/storage/replica_command.go, line 716 at r2 (raw file):
There are also entry points to AdminSplit that don't go through executeAdminBatch, IIRC. pkg/storage/store.go, line 669 at r2 (raw file):
Be more specific, since there's already EvalKnobs.TestingEvalFilter which runs at evaluation time. (the comment on ReplicaRequestFilter about when this runs belongs here, not on the function typedef) Comments from Reviewable |
7dda735
to
3caaeca
Compare
TFTRs! Added a new commit (now the second) to address #23310 (comment). Review status: 2 of 5 files reviewed at latest revision, 2 unresolved discussions. pkg/storage/replica_command.go, line 716 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It doesn't look like there are anymore. Everything else goes through the Replica by sending an You might be thinking about the pkg/storage/store.go, line 669 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Expanded this comment here but left the other because all of these function typedefs seem to have similar descriptions. Comments from Reviewable |
3caaeca
to
a48edc7
Compare
This change adds another testing filter that can be used to influence the error returned from a request before it is evaluated. Notably, the filter is run before the request is added to the CommandQueue, so blocking in the filter will not block interfering requests. Release note: None
This loop used to return a `nil` error once the retry loop was exhausted, which made it look like the split had succeeded. It now returns an error. Release note: None
Fixes cockroachdb#23673. This change adds a call to `redirectOnOrAcquireLease` to the AdminSplit retry loop. This ensures that each attempt to perform a split is done by a leaseholder. Without this check, it was possible that a stale follower or a replica that had already been removed from the range could end up in an infinite retry loop. This was possible because these replicas could have stale knowledge about the state of the range descriptor. The effect of this was that the replicas' updates to the local range descriptor key could indefinitely hit `ConditionFailedErrors`, which are retried. The change also adds a regression test that creates this "replica removed while splitting" scenario. Without the leaseholder check introduced here, the test would fail. Now that the root cause of this infinite retry loop is fixed, we should consider reverting cockroachdb#23310. This loop should not retry indefinitely. If it does, exiting will just mask the issue. Release note: None
a48edc7
to
85475fe
Compare
In tpccbench/nodes=9/cpu=4/multi-region, I consistently see the RESTORE fail a split because the retry loop in executeAdminCommandWithDescriptor interleaves with replicate queue activity until the retries are exhausted. This is likely since due to the geo-replicated nature of the cluster, the many steps involved in replicating the range take much longer than we're used to; in the logs I can see the split attempt and an interleaving replication change taking turns until the split gives up. The solution is to just retry the split forever as long as we're only seeing errors we know should eventually resolve. The retry limit was introduced to fix cockroachdb#23310 but there is lots of FUD around it. It's very likely that the problem encountered there was really fixed in cockroachdb#23762, which added a lease check between the split attempts. Touches cockroachdb#41028. Release justification: fixes roachtest failure, is low risk Release note: None
In tpccbench/nodes=9/cpu=4/multi-region, I consistently see the RESTORE fail a split because the retry loop in executeAdminCommandWithDescriptor interleaves with replicate queue activity until the retries are exhausted. This is likely since due to the geo-replicated nature of the cluster, the many steps involved in replicating the range take much longer than we're used to; in the logs I can see the split attempt and an interleaving replication change taking turns until the split gives up. The solution is to just retry the split forever as long as we're only seeing errors we know should eventually resolve. The retry limit was introduced to fix cockroachdb#23310 but there is lots of FUD around it. It's very likely that the problem encountered there was really fixed in cockroachdb#23762, which added a lease check between the split attempts. Touches cockroachdb#41028. Release justification: fixes roachtest failure, is low risk Release note: None
Fixes #23673.
This change adds a call to
redirectOnOrAcquireLease
to the AdminSplitretry loop. This ensures that each attempt to perform a split is done
by a leaseholder. Without this check, it was possible that a stale
follower or a replica that had already been removed from the range
could end up in an infinite retry loop. This was possible because
these replicas could have stale knowledge about the state of the
range descriptor. The effect of this was that the replicas' updates
to the local range descriptor key could indefinitely hit
ConditionFailedErrors
, which are retried.The change also adds a regression test that creates this
"replica removed while splitting" scenario. Without the leaseholder
check introduced here, the test would fail.
Release note: None