Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: check leaseholder status in AdminSplit retry loop #23762

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Mar 12, 2018

Fixes #23673.

This change adds a call to redirectOnOrAcquireLease to the AdminSplit
retry loop. This ensures that each attempt to perform a split is done
by a leaseholder. Without this check, it was possible that a stale
follower or a replica that had already been removed from the range
could end up in an infinite retry loop. This was possible because
these replicas could have stale knowledge about the state of the
range descriptor. The effect of this was that the replicas' updates
to the local range descriptor key could indefinitely hit
ConditionFailedErrors, which are retried.

The change also adds a regression test that creates this
"replica removed while splitting" scenario. Without the leaseholder
check introduced here, the test would fail.

Release note: None

@nvanbenschoten nvanbenschoten requested review from bdarnell, benesch, petermattis and a team March 12, 2018 21:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch
Copy link
Contributor

benesch commented Mar 12, 2018

Now that the root cause of this infinite retry loop is fixed, we should consider reverting #23310. This loop should not retry indefinitely. If it does, exiting will just mask a more serious issue.

Well, the intent was to return the persistent error up into the log file or jobs table where it would be visible to the user. Granted, I botched that 🙈, but I'm still nervous about retry loops that never bail out. Is this concern misplaced? This edge case went unnoticed for a while; I'm worried there might be another.

@bdarnell
Copy link
Contributor

:lgtm:

I think the limit on the retry loop should stay.


Reviewed 3 of 3 files at r1, 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/replica_command.go, line 716 at r2 (raw file):

	for retryable := retry.StartWithCtx(ctx, retryOpts); retryable.Next(); {
		// Admin commands always require the range lease to begin (see
		// executeAdminBatch), but we may have lost it while in this retry loop.

There are also entry points to AdminSplit that don't go through executeAdminBatch, IIRC.


pkg/storage/store.go, line 669 at r2 (raw file):

	EvalKnobs batcheval.TestingKnobs

	// TestingRequestFilter is called before evaluating each command.

Be more specific, since there's already EvalKnobs.TestingEvalFilter which runs at evaluation time. (the comment on ReplicaRequestFilter about when this runs belongs here, not on the function typedef)


Comments from Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/splitWithLease branch from 7dda735 to 3caaeca Compare March 13, 2018 14:28
@nvanbenschoten
Copy link
Member Author

TFTRs! Added a new commit (now the second) to address #23310 (comment).


Review status: 2 of 5 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/replica_command.go, line 716 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There are also entry points to AdminSplit that don't go through executeAdminBatch, IIRC.

It doesn't look like there are anymore. Everything else goes through the Replica by sending an AdminSplitRequest.

You might be thinking about the splitQueue, which calls adminSplitWithDescriptor directly. Because this bypasses AdminSplit, it is not subject to the retry loop. This looks like it was an intentional decision made in #10728 (see #10728 (comment)).


pkg/storage/store.go, line 669 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Be more specific, since there's already EvalKnobs.TestingEvalFilter which runs at evaluation time. (the comment on ReplicaRequestFilter about when this runs belongs here, not on the function typedef)

Expanded this comment here but left the other because all of these function typedefs seem to have similar descriptions.


Comments from Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/splitWithLease branch from 3caaeca to a48edc7 Compare March 13, 2018 14:30
This change adds another testing filter that can be used to
influence the error returned from a request before it is evaluated.
Notably, the filter is run before the request is added to the
CommandQueue, so blocking in the filter will not block interfering
requests.

Release note: None
This loop used to return a `nil` error once the retry loop was
exhausted, which made it look like the split had succeeded. It
now returns an error.

Release note: None
Fixes cockroachdb#23673.

This change adds a call to `redirectOnOrAcquireLease` to the AdminSplit
retry loop. This ensures that each attempt to perform a split is done
by a leaseholder. Without this check, it was possible that a stale
follower or a replica that had already been removed from the range
could end up in an infinite retry loop. This was possible because
these replicas could have stale knowledge about the state of the
range descriptor. The effect of this was that the replicas' updates
to the local range descriptor key could indefinitely hit
`ConditionFailedErrors`, which are retried.

The change also adds a regression test that creates this
"replica removed while splitting" scenario. Without the leaseholder
check introduced here, the test would fail.

Now that the root cause of this infinite retry loop is fixed, we
should consider reverting cockroachdb#23310. This loop should not retry
indefinitely. If it does, exiting will just mask the issue.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/splitWithLease branch from a48edc7 to 85475fe Compare March 13, 2018 15:52
@nvanbenschoten nvanbenschoten merged commit ae35649 into cockroachdb:master Mar 13, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/splitWithLease branch March 13, 2018 17:25
tbg added a commit to tbg/cockroach that referenced this pull request Oct 8, 2019
In tpccbench/nodes=9/cpu=4/multi-region, I consistently see the RESTORE
fail a split because the retry loop in executeAdminCommandWithDescriptor
interleaves with replicate queue activity until the retries are
exhausted. This is likely since due to the geo-replicated nature of the
cluster, the many steps involved in replicating the range take much
longer than we're used to; in the logs I can see the split attempt
and an interleaving replication change taking turns until the split
gives up. The solution is to just retry the split forever as long
as we're only seeing errors we know should eventually resolve.

The retry limit was introduced to fix cockroachdb#23310 but there is lots of
FUD around it. It's very likely that the problem encountered there
was really fixed in cockroachdb#23762, which added a lease check between the
split attempts.

Touches cockroachdb#41028.

Release justification: fixes roachtest failure, is low risk

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Oct 16, 2019
In tpccbench/nodes=9/cpu=4/multi-region, I consistently see the RESTORE
fail a split because the retry loop in executeAdminCommandWithDescriptor
interleaves with replicate queue activity until the retries are
exhausted. This is likely since due to the geo-replicated nature of the
cluster, the many steps involved in replicating the range take much
longer than we're used to; in the logs I can see the split attempt
and an interleaving replication change taking turns until the split
gives up. The solution is to just retry the split forever as long
as we're only seeing errors we know should eventually resolve.

The retry limit was introduced to fix cockroachdb#23310 but there is lots of
FUD around it. It's very likely that the problem encountered there
was really fixed in cockroachdb#23762, which added a lease check between the
split attempts.

Touches cockroachdb#41028.

Release justification: fixes roachtest failure, is low risk

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants