Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: use tscache summary to eliminate txn retries due to lease transfers and range merges #61986

Closed
nvanbenschoten opened this issue Mar 14, 2021 · 2 comments · Fixed by #118300
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-transactions Relating to MVCC and the transactional model. A-read-committed Related to the introduction of Read Committed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 14, 2021

Background / Motivation

In #60521, we began shipping a summary of an outgoing leaseholder's timestamp cache during lease transfers and range merges. The incoming / post-merge leaseholder then applies this summary after assuming control of the keyspace to ensure that no future writes are allowed to invalidate prior reads.

The structure, represented by a new ReadSummary proto, includes information like read on key "a" were served up to time 10. Because the timestamp cache is used to prevent transaction replays (see CanCreateTxnRecord), it also includes information like txn 1234 already committed.

This replaced an existing mechanism where the incoming leaseholder would conservatively bump its timestamp cache all the way up to the lease start time / merge freeze time across its entire keyspace. The higher fidelity summary allows the new leaseholder to avoid bumping its timestamp cache as aggressively. This can limit the impact of lease transfers on foreground traffic - reducing false read-write contention which causes transaction retries (e.g. see TestStoreLeaseTransferTimestampCacheRead) and reducing false positives for transaction replay detection which causes transaction aborts (e.g. see TestStoreLeaseTransferTimestampCacheTxnRecord).

However, currently, the ReadSummary structure still only maintains a low-resolution snapshot of the outgoing leaseholder's timestamp cache. So while #60521 introduced the mechanism to ship a timestamp cache summary (and fixed two bugs in the process), it didn't actually begin taking full advantage of this new mechanism.

Proposed Change

Now that the ReadSummary mechanism is in place, we can begin using it to limit transaction retries due to lease transfers and range merges.

First, we'll want to clean up the compatibility code added in #60521, now that we can be sure that all nodes in any cluster that touch our changes here will be aware of the ReadSummary and will be using the per-range closed timestamp system. This will allow us to remove code like this and this.

Once we do that, we'll want to augment the ReadSummary structure with the ability to carry high-resolution information, subject to some memory limits. We'll update Replica.GetCurrentReadSummary to collect this information from the leaseholder's timestamp cache. The details of this structure are TBD, as are the details of the memory limits. We'll likely want to prioritize the local keyspace segment over the global keyspace segment, as txn aborts are more disruptive than txn pushes, because they cannot be refreshed away.

Once we can represent and capture higher-resolution information in a ReadSummary, we'll want to introduce some form of compression of these summaries. This will allow us to achieve the desired scheme of shipping a high-res, high-mem summary on log entries but only persisting a low-res, low-mem summary indefinitely in a range's keyspace:

// When a ReadSummary is set in a ReplicatedEvalResult, there is always also a
// write to the RangePriorReadSummaryKey in the RaftCommand.WriteBatch. The
// persisted summary may be identical to the summary in this field, but it
// does not have to be. Notably, we intended for the summary included in the
// ReplicatedEvalResult to eventually be a much higher-resolution version of
// the ReadSummmary than the version persisted. This scheme of persisting a
// compressed ReadSummary indefinitely and including a higher-resolution
// ReadSummary on the RaftCommand allows us to optimize for the common case
// where the lease transfer is applied on the new leaseholder through Raft log
// application while ensuring correctness in the case where the lease transfer
// is applied on the new leaseholder through a Raft snapshot.

With these changes in place, we should see a stark drop in the disruption of lease transfers and range merges to foreground traffic.

Jira issue: CRDB-2701

gz#15057

gz#19312

Epic CRDB-34172

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-kv-transactions Relating to MVCC and the transactional model. labels Mar 14, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 26, 2021
Fixes cockroachdb#60929.
Relates to cockroachdb#61986.
Relates to cockroachdb#61989.

This commit fixes a closed timestamp violation that could allow a
value/intent write at a timestamp below a range's closed timestamp. This
could allow for serializability violations if it allowed a follower read
to miss a write and could lead to a panic in the rangefeed processor if
a rangefeed was watching at the right time, as we saw in cockroachdb#60929.

In cockroachdb#60929, we found that this bug was caused by a range merge and a
lease transfer racing in such a way that the closed timestamp could
later be violated by a write to the subsumed portion of the joint range.
The root cause of this was an opportunistic optimization made in 7037b54
to consider a range's lease start time as an input to its closed
timestamp computation. This optimization did not account for the
possibility of serving writes to a newly subsumed keyspace below a
range's lease start time if that keyspace was merged into a range under
its current lease and with a freeze time below the current lease start
time. This bug is fixed by removing the optimization, which was on its
way out to allow for cockroachdb#61986 anyway.

Note that removing this optimization does not break
`TestClosedTimestampCanServeThroughoutLeaseTransfer`, because the v2
closed timestamp system does not allow for closed timestamp regressions,
even across leaseholders. This was one of the many benefits of the new
system.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 30, 2021
Fixes cockroachdb#60929.
Relates to cockroachdb#61986.
Relates to cockroachdb#61989.

This commit fixes a closed timestamp violation that could allow a
value/intent write at a timestamp below a range's closed timestamp. This
could allow for serializability violations if it allowed a follower read
to miss a write and could lead to a panic in the rangefeed processor if
a rangefeed was watching at the right time, as we saw in cockroachdb#60929.

In cockroachdb#60929, we found that this bug was caused by a range merge and a
lease transfer racing in such a way that the closed timestamp could
later be violated by a write to the subsumed portion of the joint range.
The root cause of this was an opportunistic optimization made in 7037b54
to consider a range's lease start time as an input to its closed
timestamp computation. This optimization did not account for the
possibility of serving writes to a newly subsumed keyspace below a
range's lease start time if that keyspace was merged into a range under
its current lease and with a freeze time below the current lease start
time. This bug is fixed by removing the optimization, which was on its
way out to allow for cockroachdb#61986 anyway.

Note that removing this optimization does not break
`TestClosedTimestampCanServeThroughoutLeaseTransfer`, because the v2
closed timestamp system does not allow for closed timestamp regressions,
even across leaseholders. This was one of the many benefits of the new
system.
craig bot pushed a commit that referenced this issue Mar 30, 2021
62570: kv: don't consider lease start time as closed timestamp r=nvanbenschoten a=nvanbenschoten

Fixes #60929.
Relates to #61986.
Relates to #61989.

This commit fixes a closed timestamp violation that could allow a
value/intent write at a timestamp below a range's closed timestamp. This
could allow for serializability violations if it allowed a follower read
to miss a write and could lead to a panic in the rangefeed processor if
a rangefeed was watching at the right time, as we saw in #60929.

In #60929, we found that this bug was caused by a range merge and a
lease transfer racing in such a way that the closed timestamp could
later be violated by a write to the subsumed portion of the joint range.
The root cause of this was an opportunistic optimization made in 7037b54
to consider a range's lease start time as an input to its closed
timestamp computation. This optimization did not account for the
possibility of serving writes to a newly subsumed keyspace below a
range's lease start time if that keyspace was merged into a range under
its current lease and with a freeze time below the current lease start
time. This bug is fixed by removing the optimization, which was on its
way out to allow for #61986 anyway.

Note that removing this optimization does not break
`TestClosedTimestampCanServeThroughoutLeaseTransfer`, because the v2
closed timestamp system does not allow for closed timestamp regressions,
even across leaseholders. This was one of the many benefits of the new
system.

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 30, 2021
Fixes cockroachdb#60929.
Relates to cockroachdb#61986.
Relates to cockroachdb#61989.

This commit fixes a closed timestamp violation that could allow a
value/intent write at a timestamp below a range's closed timestamp. This
could allow for serializability violations if it allowed a follower read
to miss a write and could lead to a panic in the rangefeed processor if
a rangefeed was watching at the right time, as we saw in cockroachdb#60929.

In cockroachdb#60929, we found that this bug was caused by a range merge and a
lease transfer racing in such a way that the closed timestamp could
later be violated by a write to the subsumed portion of the joint range.
The root cause of this was an opportunistic optimization made in 7037b54
to consider a range's lease start time as an input to its closed
timestamp computation. This optimization did not account for the
possibility of serving writes to a newly subsumed keyspace below a
range's lease start time if that keyspace was merged into a range under
its current lease and with a freeze time below the current lease start
time. This bug is fixed by removing the optimization, which was on its
way out to allow for cockroachdb#61986 anyway.

Note that removing this optimization does not break
`TestClosedTimestampCanServeThroughoutLeaseTransfer`, because the v2
closed timestamp system does not allow for closed timestamp regressions,
even across leaseholders. This was one of the many benefits of the new
system.
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@ajwerner
Copy link
Contributor

ajwerner commented May 4, 2022

This will help with restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) right?

@nvanbenschoten
Copy link
Member Author

Yes, this should eliminate the majority of those errors after cooperative lease transfers.

@nvanbenschoten nvanbenschoten added the A-read-committed Related to the introduction of Read Committed label Jun 5, 2023
@andrewbaptist andrewbaptist added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Dec 18, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 20, 2023
Informs cockroachdb#61986.

Now that the use of ReadSummaries has been fully phased in, we don't
need this logic. However, ReadSummaries are still low resolution, so
we'll bump the incoming leaseholder's timestamp cache to the maximum
timestamp in the outgoing leaseholder's timestamp cache.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 25, 2024
Informs cockroachdb#61986.

This commit adds a new `Serialize` function to `tscache.Cache` implementations.
This serialization uses the `readsummary/rspb.Segment` representation added in
0950a1e. Serialization of the production `sklImpl` uses the Segment merging
logic added in e4fc6f1 in order to merge together a partial serializations of
each individual `sklPage` in the data structure.

Timestamp cache serialization will be used to address cockroachdb#61986.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 29, 2024
… range merges

Fixes cockroachdb#61986.
Fixes cockroachdb#117486.
Unblocks cockroachdb#118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in cockroachdb#118299 to
ship high-resolution summaries of the timestamp cache during lease transfers and
ranges merges. In doing so, it eliminates the loss of precision that occurs in an
incoming leaseholder's timestamp cache when it receives a lease transfer or range
merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to
   TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN`
   reason.
2. txn push marker keys would have their timestamp advanced, leading to
   transactions having their commit timestamp pushed, which could lead to
   TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read,
   also leading to transactions having their commit timestamp pushed if they
   wrote to those keys, which could also lead to TransactionRetryError with the
   `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed
away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of
these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to
0B. This default configuration should be sufficient to eliminate the first two
sources of retries described above. The third has not been observed as a serious
issue in practice, so we configure the global budget to 0 so that we can hit a
serialization fast-path.

Release note (ops change): Transaction replay protection state is now
passed between the outgoing and incoming leaseholder for a range during
a lease transfer. This avoids cases where lease transfers can cause
transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.
craig bot pushed a commit that referenced this issue Jan 31, 2024
118299: kv/tscache: implement timestamp cache serialization r=nvanbenschoten a=nvanbenschoten

Informs #61986.

This commit adds a new `Serialize` function to `tscache.Cache` implementations. This serialization uses the `readsummary/rspb.Segment` representation added in 0950a1e. Serialization of the production `sklImpl` uses the Segment merging logic added in e4fc6f1 in order to merge together a partial serializations of each individual `sklPage` in the data structure.

Timestamp cache serialization will be used to address #61986.

Release note: None

118543: catalog/lease: detect if synchronous lease releases are successful r=fqazi a=fqazi

Previously, for unit testing, we added support for synchronously releasing leases. If the context was cancelled when releasing a lease synchronously, it was possible for the lease to be erased from memory and not from storage. As a result, reacquisition could hit an error when session-based leasing is enabled. To address this, this patch re-orders operations so that we clear storage first for synchronous lease release, followed by the in-memory copy.

Fixes: #118522, fixes #118523, fixes #118521, fixes #118550

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 31, 2024
… range merges

Fixes cockroachdb#61986.
Fixes cockroachdb#117486.
Unblocks cockroachdb#118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in cockroachdb#118299 to
ship high-resolution summaries of the timestamp cache during lease transfers and
ranges merges. In doing so, it eliminates the loss of precision that occurs in an
incoming leaseholder's timestamp cache when it receives a lease transfer or range
merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to
   TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN`
   reason.
2. txn push marker keys would have their timestamp advanced, leading to
   transactions having their commit timestamp pushed, which could lead to
   TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read,
   also leading to transactions having their commit timestamp pushed if they
   wrote to those keys, which could also lead to TransactionRetryError with the
   `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed
away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of
these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to
0B. This default configuration should be sufficient to eliminate the first two
sources of retries described above. The third has not been observed as a serious
issue in practice, so we configure the global budget to 0 so that we can hit a
serialization fast-path.

Release note (ops change): Transaction replay protection state is now
passed between the outgoing and incoming leaseholder for a range during
a lease transfer. This avoids cases where lease transfers can cause
transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.
craig bot pushed a commit that referenced this issue Feb 1, 2024
115746: kv: log on excessive latch hold duration r=lyang24 a=lyang24

This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: #114609

Release note: None

118300: kv: ship high-resolution tscache summaries during lease transfers and range merges r=nvanbenschoten a=nvanbenschoten

Fixes #61986.
Fixes #117486.
Unblocks #118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in #118299 to ship high-resolution summaries of the timestamp cache during lease transfers and ranges merges. In doing so, it eliminates the loss of precision that occurs in an incoming leaseholder's timestamp cache when it receives a lease transfer or range merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN` reason.
2. txn push marker keys would have their timestamp advanced, leading to transactions having their commit timestamp pushed, which could lead to TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read, also leading to transactions having their commit timestamp pushed if they wrote to those keys, which could also lead to TransactionRetryError with the `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to 0B. This default configuration should be sufficient to eliminate the first two sources of retries described above. The third has not been observed as a serious issue in practice, so we configure the global budget to 0 so that we can hit a serialization fast-path.

Release note (ops change): Transaction replay protection state is now passed between the outgoing and incoming leaseholder for a range during a lease transfer. This avoids cases where lease transfers can cause transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.

Co-authored-by: lyang24 <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in a4c578c Feb 2, 2024
craig bot pushed a commit that referenced this issue Feb 2, 2024
118000: roachtest: run Read Committed variants of TPC-C without txn retry loops r=nvanbenschoten a=nvanbenschoten

Closes #115191.
Depends on #61986.

This commit switches the two nightly Read Committed variants of the TPC-C roachtest to run without transaction retry loops, using the `--txn-retries` flag introduced in #117096. With #117630 and #61986 resolved (the latter of which is still in review and under development), these tests both pass.

Release note: None

118600: gceworker bootstrap: fix checksum for go download r=rickystewart a=jlinder

On the last go version upgrade to cockroach, the wrong checksum was entered in the gceworker bootstrap script for the downloaded tar file. This fixes it to be the correct checksum.

Epic: none
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: James H. Linder <[email protected]>
cockroach-dev-inf pushed a commit that referenced this issue Feb 15, 2024
Informs #115191.

This commit adds a `--txn-retries` flag to `tpcc`, allowing users of the
workload to disable transaction retry loops for 40001 errors.

For Serializable transactions (controllable with `--isolation-level`),
this quickly leads to errors being thrown by the workload. For Read
Committed transaction, the workload eventually hits an error due to
a lease transfer. Combined with a prototype fix for #61986, Read
Committed transactions survive for at least a few minutes without
error.

Release note: None
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
Informs cockroachdb#61986.

This commit adds a new `Serialize` function to `tscache.Cache` implementations.
This serialization uses the `readsummary/rspb.Segment` representation added in
0950a1e. Serialization of the production `sklImpl` uses the Segment merging
logic added in e4fc6f1 in order to merge together a partial serializations of
each individual `sklPage` in the data structure.

Timestamp cache serialization will be used to address cockroachdb#61986.

Release note: None
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
… range merges

Fixes cockroachdb#61986.
Fixes cockroachdb#117486.
Unblocks cockroachdb#118000.

This commit uses the new `tscache.Cache.Serialize` method introduced in cockroachdb#118299 to
ship high-resolution summaries of the timestamp cache during lease transfers and
ranges merges. In doing so, it eliminates the loss of precision that occurs in an
incoming leaseholder's timestamp cache when it receives a lease transfer or range
merge.

This loss of precision was a source of transaction retries for three reasons:
1. txn tombstone marker keys would have their timestamp advanced, leading to
   TransactionAbortedError with the `ABORT_REASON_NEW_LEASE_PREVENTS_TXN`
   reason.
2. txn push marker keys would have their timestamp advanced, leading to
   transactions having their commit timestamp pushed, which could lead to
   TransactionRetryError with the `RETRY_SERIALIZABLE` reason.
3. global keys would have their timestamp advanced as if they had been read,
   also leading to transactions having their commit timestamp pushed if they
   wrote to those keys, which could also lead to TransactionRetryError with the
   `RETRY_SERIALIZABLE` reason.

The first issue here is the most severe, both because it can not be refreshed
away and because it affects transactions of all isolation levels.

This commit introduces two new cluster settings to control the maximum size of
these timestamp cache read summaries:
- `kv.lease_transfer_read_summary.local_budget`
- `kv.lease_transfer_read_summary.global_budget`

It configures the local keyspace budget to 4MB and the global keyspace budget to
0B. This default configuration should be sufficient to eliminate the first two
sources of retries described above. The third has not been observed as a serious
issue in practice, so we configure the global budget to 0 so that we can hit a
serialization fast-path.

Release note (ops change): Transaction replay protection state is now
passed between the outgoing and incoming leaseholder for a range during
a lease transfer. This avoids cases where lease transfers can cause
transactions to throw TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) errors.
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
Closes cockroachdb#115191.
Depends on cockroachdb#61986.

This commit switches the two nightly Read Committed variants of the TPC-C
roachtest to run without transaction retry loops, using the `--txn-retries` flag
introduced in cockroachdb#117096. With cockroachdb#117630 and cockroachdb#61986 resolved (the latter of which
is still in review and under development), these tests both pass.

Release note: None
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-transactions Relating to MVCC and the transactional model. A-read-committed Related to the introduction of Read Committed C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-kv KV Team
Projects
No open projects
Status: Closed
4 participants