Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: promote expiration-based lease to epoch without sequence number change #117840

Merged
merged 3 commits into from
Jan 19, 2024

Conversation

nvanbenschoten
Copy link
Member

Fixes #117630.
Fixes #90656.
Fixes #98553.

Informs #61986.
Informs #115191.

This commit updates the post-lease transfer promotion of expiration-based leases to epoch-based leases to not change the sequence number of the lease. This avoids invalidating all requests proposed under the original expiration-based lease, which can lead to RETRY_ASYNC_WRITE_FAILURE errors.

The change accomplishes this by updating the Lease.Equivalent method to consider an expiration-based lease to be equivalent to an epoch-based lease that is held by the same replica and has the same start time. Doing so requires some care, because lease equivalency is checked below Raft and needs to remain deterministic across binary versions.

This change requires a cluster version check, so it cannot be backported.

Release note (bug fix): Improved an interaction during range lease transfers which could previously cause RETRY_ASYNC_WRITE_FAILURE errors to be returned to clients.

Test changes to just the leaseholder or just the start time, instead of
both at the same time. This ensures that either lead to a sequence
number change.

Epic: None
Release note: None
This commit is a small refactor so that we can eliminate the second
call to Lease.Equivalent in RequestLease. There is already one in
evalNewLease which is shared between RequestLease and TransferLease.

Epic: None
Release note: None
…hange

Fixes cockroachdb#117630.
Fixes cockroachdb#90656.
Fixes cockroachdb#98553.

Informs cockroachdb#61986.
Informs cockroachdb#115191.

This commit updates the post-lease transfer promotion of expiration-based leases
to epoch-based leases to not change the sequence number of the lease. This
avoids invalidating all requests proposed under the original expiration-based
lease, which can lead to `RETRY_ASYNC_WRITE_FAILURE` errors.

The change accomplishes this by updating the `Lease.Equivalent` method to
consider an expiration-based lease to be equivalent to an epoch-based lease that
is held by the same replica and has the same start time. Doing so requires some
care, because lease equivalency is checked below Raft and needs to remain
deterministic across binary versions.

This change requires a cluster version check, so it cannot be backported.

Release note (bug fix): Improved an interaction during range lease transfers
which could previously cause `RETRY_ASYNC_WRITE_FAILURE` errors to be returned
to clients.
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner January 16, 2024 23:58
@nvanbenschoten nvanbenschoten requested a review from a team January 16, 2024 23:58
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change requires a cluster version check, so it cannot be backported.

That's too bad. We should document for support purposes that these can also be avoided by setting kv.transfer_expiration_leases_first.enabled = false, with caveats about the unavailability risk.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 10 of 10 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/kvserverpb/proposer_kv.proto line 140 at r3 (raw file):

  // - in v24.2, we can set this field to true unconditionally for lease request
  //   proposals. It must still be consulted during lease application, because
  //   the raft proposal may have been performed by an older node.

It's annoying that we need 4 releases to migrate this, but it seems prudent. I was thinking of ways to get it down to 3, but it's riskier and there's no real rush I guess.


pkg/roachpb/data.go line 1885 at r3 (raw file):

// NB: Lease.Equivalent is NOT symmetric. For expiration-based
// leases, a lease is equivalent to another with an equal or
// later expiration, but not an earlier expiration.

nit: should we spell out the expiration->epoch case here, which is asymmetric as well?


pkg/roachpb/data.go line 1939 at r3 (raw file):

			// For expiration-based leases, extensions are considered equivalent.
			// This is one case where Equivalent is not commutative and, as such,

nit: ditto this comment.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change requires a cluster version check, so it cannot be backported.

That's too bad. We should document for support purposes that these can also be avoided by setting kv.transfer_expiration_leases_first.enabled = false, with caveats about the unavailability risk.

We put #116092 in place in earlier releases to eliminate the risk of these RETRY_ASYNC_WRITE_FAILURE retries for COPY. Beyond that specific use case, I'm not sure how much more prescriptive we can be towards support. kv.transfer_expiration_leases_first.enabled is an important protection and we don't want users switching that to false indefinitely, so I don't think we want to document that in a runbook as the "solution" to RETRY_ASYNC_WRITE_FAILURE errors.

Still, I think there are cases where this setting can temporarily help and it's better to keep support in the loop, so I wrote up https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/3332505662/RETRY+ASYNC+WRITE+FAILURE+during+lease+transfers about this.

TFTR!

bors r=erikgrinaker

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/roachpb/data.go line 1885 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: should we spell out the expiration->epoch case here, which is asymmetric as well?

Done.


pkg/roachpb/data.go line 1939 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: ditto this comment.

Done.

@craig craig bot merged commit 2a98613 into cockroachdb:master Jan 19, 2024
9 checks passed
@craig
Copy link
Contributor

craig bot commented Jan 19, 2024

Build succeeded:

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/expToEpochSeq branch January 19, 2024 19:39
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 19, 2024
I had missed a `git push` immediately before merging. This updates
two comments.

Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Jan 20, 2024
117969: batcheval: add `PushTxnResponse.AmbiguousAbort` r=erikgrinaker a=erikgrinaker

Extracted from #117612.

---

This indicates to the caller that the `ABORTED` status of the pushed transaction is ambiguous, and the transaction may in fact have been committed and GCed already. This information is also plumbed through the `IntentResolver` txn push APIs.

Touches #104309.
Epic: none
Release note: None

117992: roachpb: address review comments from #117840 r=nvanbenschoten a=nvanbenschoten

I had missed a `git push` immediately before merging #117840. This updates two comments.

Epic: None
Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants