Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cherrypick-2.0: storage: check leaseholder status in AdminSplit retry loop #23794

Merged

Conversation

nvanbenschoten
Copy link
Member

Cherrypick of #23762.

This change adds a call to redirectOnOrAcquireLease to the AdminSplit
retry loop. This ensures that each attempt to perform a split is done
by a leaseholder. Without this check, it was possible that a stale
follower or a replica that had already been removed from the range
could end up in an infinite retry loop. This was possible because
these replicas could have stale knowledge about the state of the
range descriptor. The effect of this was that the replicas' updates
to the local range descriptor key could indefinitely hit
ConditionFailedErrors, which are retried.

The change also adds a regression test that creates this
"replica removed while splitting" scenario. Without the leaseholder
check introduced here, the test would fail.

This change adds another testing filter that can be used to
influence the error returned from a request before it is evaluated.
Notably, the filter is run before the request is added to the
CommandQueue, so blocking in the filter will not block interfering
requests.

Release note: None
This loop used to return a `nil` error once the retry loop was
exhausted, which made it look like the split had succeeded. It
now returns an error.

Release note: None
Fixes cockroachdb#23673.

This change adds a call to `redirectOnOrAcquireLease` to the AdminSplit
retry loop. This ensures that each attempt to perform a split is done
by a leaseholder. Without this check, it was possible that a stale
follower or a replica that had already been removed from the range
could end up in an infinite retry loop. This was possible because
these replicas could have stale knowledge about the state of the
range descriptor. The effect of this was that the replicas' updates
to the local range descriptor key could indefinitely hit
`ConditionFailedErrors`, which are retried.

The change also adds a regression test that creates this
"replica removed while splitting" scenario. Without the leaseholder
check introduced here, the test would fail.

Now that the root cause of this infinite retry loop is fixed, we
should consider reverting cockroachdb#23310. This loop should not retry
indefinitely. If it does, exiting will just mask the issue.

Release note: None
@nvanbenschoten nvanbenschoten requested review from bdarnell and a team March 13, 2018 17:33
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten merged commit 41f9808 into cockroachdb:release-2.0 Mar 13, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/23762 branch March 13, 2018 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants