Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: post-lease transfer promotion from expiration-based lease to epoch-based lease causes failed writes #117630

Closed
nvanbenschoten opened this issue Jan 10, 2024 · 1 comment · Fixed by #117840
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-read-committed Related to the introduction of Read Committed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jan 10, 2024

In fc2d180, we adjusted the lease transfer protocol to pass through an expiration-based lease before being upgraded to an expiration-based lease. This works, but it means that the lease sequence advances twice during a lease transfer. As a result, writes proposed under the expiration-based lease can be re-ordered with the lease promotion and will fail below raft.

This can lead to RETRY_ASYNC_WRITE_FAILURE errors and may be the underlying cause of #90656 and #98553.

We should fix this, ideally by promoting from an expiration-based lease to an epoch-based lease without advancing the lease sequence number.

Jira issue: CRDB-35282

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team labels Jan 10, 2024
@nvanbenschoten nvanbenschoten self-assigned this Jan 10, 2024
@nvanbenschoten nvanbenschoten added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months A-read-committed Related to the introduction of Read Committed labels Jan 16, 2024
@nvanbenschoten
Copy link
Member Author

This is related to #61986, because without this change, read summaries shipped during the lease transfer will be invalidated by the lease sequence increment during the expiration-based lease to epoch-based lease promotion.

craig bot pushed a commit that referenced this issue Jan 19, 2024
117840: kv: promote expiration-based lease to epoch without sequence number change r=erikgrinaker a=nvanbenschoten

Fixes #117630.
Fixes #90656.
Fixes #98553.

Informs #61986.
Informs #115191.

This commit updates the post-lease transfer promotion of expiration-based leases to epoch-based leases to not change the sequence number of the lease. This avoids invalidating all requests proposed under the original expiration-based lease, which can lead to `RETRY_ASYNC_WRITE_FAILURE` errors.

The change accomplishes this by updating the `Lease.Equivalent` method to consider an expiration-based lease to be equivalent to an epoch-based lease that is held by the same replica and has the same start time. Doing so requires some care, because lease equivalency is checked below Raft and needs to remain deterministic across binary versions.

This change requires a cluster version check, so it cannot be backported.

Release note (bug fix): Improved an interaction during range lease transfers which could previously cause `RETRY_ASYNC_WRITE_FAILURE` errors to be returned to clients.

117899: backupccl: skip `TestBackupRestoreAppend` under `deadlock` r=rail a=rickystewart

These tests are likely to time out.

Epic: CRDB-8308
Release note: None

117940: backupccl,sql: skip a couple more tests under duress r=rail a=rickystewart

These tests are all timing out. For the failures that seem suspect in some way, I have filed GitHub issues.

Epic: CRDB-8308
Release note: None

117950: copy: skip TestCopyFromRetries for now r=yuzefovich a=yuzefovich

We recently expanded this test and it became flaky. Skip it until we stabilize it.

Informs: #117912.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 54ebbac Jan 19, 2024
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 19, 2024
Closes cockroachdb#115191.
Depends on cockroachdb#61986.

This commit switches the two nightly Read Committed variants of the TPC-C
roachtest to run without transaction retry loops, using the `--txn-retries` flag
introduced in cockroachdb#117096. With cockroachdb#117630 and cockroachdb#61986 resolved (the latter of which
is still in review and under development), these tests both pass.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 24, 2024
Closes cockroachdb#115191.
Depends on cockroachdb#61986.

This commit switches the two nightly Read Committed variants of the TPC-C
roachtest to run without transaction retry loops, using the `--txn-retries` flag
introduced in cockroachdb#117096. With cockroachdb#117630 and cockroachdb#61986 resolved (the latter of which
is still in review and under development), these tests both pass.

Release note: None
craig bot pushed a commit that referenced this issue Feb 2, 2024
118000: roachtest: run Read Committed variants of TPC-C without txn retry loops r=nvanbenschoten a=nvanbenschoten

Closes #115191.
Depends on #61986.

This commit switches the two nightly Read Committed variants of the TPC-C roachtest to run without transaction retry loops, using the `--txn-retries` flag introduced in #117096. With #117630 and #61986 resolved (the latter of which is still in review and under development), these tests both pass.

Release note: None

118600: gceworker bootstrap: fix checksum for go download r=rickystewart a=jlinder

On the last go version upgrade to cockroach, the wrong checksum was entered in the gceworker bootstrap script for the downloaded tar file. This fixes it to be the correct checksum.

Epic: none
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: James H. Linder <[email protected]>
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
Closes cockroachdb#115191.
Depends on cockroachdb#61986.

This commit switches the two nightly Read Committed variants of the TPC-C
roachtest to run without transaction retry loops, using the `--txn-retries` flag
introduced in cockroachdb#117096. With cockroachdb#117630 and cockroachdb#61986 resolved (the latter of which
is still in review and under development), these tests both pass.

Release note: None
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-read-committed Related to the introduction of Read Committed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

1 participant