kv: reduce contention to single consensus round via multiple intents #91848

sumeerbhola · 2022-11-14T17:11:46Z

The high-level idea was discussed in #41720 but that issue is very broad, and some details were lacking.
#66867 is a related issue which is limited to releasing latches before starting to evaluate intent resolution. It omits a specific design, though does hint at an approach that is further elaborated here.

To reduce contention to a single consensus round, i.e., the round to write the intent, we want to start doing conflicting writes and reads as soon as the txn transitions to STAGING and has verified all the conditions for commit, but before it actually does the transition to COMMIT (which involves another round of consensus). The details are somewhat tricky wrt maintaining cross data-structure invariants within a node, and wrt loss of in-memory state (due to memory pressure or node restart). These are sketched out below.

Simple Committed

We allow a key (with multiple versions) to have multiple intents, under the condition that at most one of the intents is uncommitted.

To aid this behavior we introduce a per-node in-memory commitMap, that maintains a logical set of TxnIDs of transactions that were "simple" in their behavior, and have committed, where simple is defined as all the following conditions:

No savepoint rollbacks: len(intent.IgnoredSeqNums)==0
Single epoch, i.e., TxnMeta.Epoch==0
Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp. There is a relaxation of this discussed below where all the writes done by that transaction are at its commit timestamp, i.e., pushing happened before writing.

The first two conditions ensure that there are no stray intents written by that transaction that will not commit. That plus the last condition ensures that their provisional values can be considered committed with the current version and local timestamp, i.e., we need no additional information about the txn other than the TxnID.

Adding to commitMap

The earliest a txn can be added to the commitMap is when the transaction is in STAGING and has verified all the conditions for commit. That is, this can be done before the transition to COMMITTED is replicated. For the node that is the leaseholder of the range containing the txn record, this requires no external communication. For other nodes with intents for the txn, one would send a RPC (this is earlier than the RPC for async intent resolution). These RPCs should also remove locks from the non-persistent lock table data-structure so need to send information about the keys (like in a LockUpdate but we won't remove the durable lock state).

Removal from commitMap

The commitMap can be considered a cache of TxnIDs. It is helpful to have a txn in the cache until its intents have been resolved, but we don't want to guarantee that due to memory pressure or node failure.

Latchless intent resolution must pin a txn in the map before it releases latches and unpin when the intent resolution has been applied on the leaseholder. This pinning behavior is needed to ensure the correctness of the in-memory concurrency.lockTable, which must maintain the property that the replicated locks known to it are a subset of the persistent replicated locks (we are assuming here that there is no lockTable on followers). If it did not pin, and a txn was evicted from the commitMap during the intent resolution, a conflicting reader/writer could find the persistent lock and add it to the in-memory concurrency.lockTable even though the lock has been resolved.

Why "simple-committed"?

Goals:

We don't want to have to coordinate intent resolution of these multiple intents, by mandating that the resolution happen in any particular order.
We want to guarantee that even if the commitMap is cleared (since it is a cache), we can maintain the invariant that a caller iterating over a key sees at most one intent. As we will illustrate below, providing this guarantee requires us to limit the commitMap to only contain simple-committed txns.

Example:
Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and intents for t5, t4, t2, with corresponding txns txn5, txn4, txn2. We consider the disposition of an intent to be either unknown or simple-committed. In this example, the disposition of the intent for t4 and t2 is simple-committed solely based on the fact that there is at least one version (provisional or otherwise) more recent that the timestamp of the intent. That is, at most one intent, the one for t5, has a disposition that needs to rely on knowledge that is not self-contained in the history. For t5, we must rely on the commitMap to decide whether is unknown or
simple-committed.

It is possible that some user of the intentInterleavingIter saw t5 as simple-committed and a later user sees it as unknown disposition, if the txn5 got removed from the commitMap. Such regression is harmless since the latter user will simply have to do intent resolution. Note that the intent for t5 could get resolved before those for t4 and t2, and that is also fine since the disposition of t4, t2 stays simple-committed. If txn5 is aborted and the intent for t5 removed, and txn4 is no longer in the commitMap, the disposition of t4 could change to unknown. This is also acceptable, since t5 was only serving as a "local promise" that t4 was committed, which is simply an optimization. There is still a less efficient globally available "promise" that t4 is committed, and intent resolution of t4 is how we will enact that promise.

Maintaining the above guarantees requires that historical versions must not be garbage collected without resolving intents. This is acceptable since GC is not latency sensitive.

Performance Considerations

commitMap per range or node: Scoping it to a range may be simpler, and result in smaller maps that have a simpler synchronization story (see next bullet). But the same TxnID can be in multiple maps, so there is the overhead of duplication.
Synchronization of commitMap: An ideal data-structure would offer very efficient O(1) lookup without the need to acquire a mutex, since map lookup happens whenever the intentInterleavingIter encounters an intent. A copy-on-write approach would be ideal for reads (which may be more viable with a per-range map). We also need to support a high rate of insertions. Removal can be batched, say by removing everything added > 50ms in the past, or based on a size constraint.
Write-write contention and lack of simple-committed writes: It is possible that the sequence of transactions that write to a key do not have monotonic timestamps (since with write-write contention there is a queue of txns waiting to write), so the successful write will be at a different timestamp than the txn's original timestamp. Which will defeat the simple-committed scheme. If we could tell that all the writes done by a txn to a range were at its commit timestamp, even if it was not the same as the txn's original timestamp, we could regain this optimization. Perhaps we can track this in the Transaction record in a loose manner (to bound the bytes needed for such tracking) -- it doesn't need to be persisted, so we won't be paying another consensus round.

Some of the intentInterleavingIter changes are prototyped in #81259

cc: @nvanbenschoten @AlexTalks

Jira issue: CRDB-21456

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. labels Nov 14, 2022

blathers-crl bot added the T-kv KV Team label Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: reduce contention to single consensus round via multiple intents #91848

kv: reduce contention to single consensus round via multiple intents #91848

sumeerbhola commented Nov 14, 2022 •

edited by cockroach-jira-scripts

Loading

kv: reduce contention to single consensus round via multiple intents #91848

kv: reduce contention to single consensus round via multiple intents #91848

Comments

sumeerbhola commented Nov 14, 2022 • edited by cockroach-jira-scripts Loading

Simple Committed

Adding to commitMap

Removal from commitMap

Why "simple-committed"?

Performance Considerations

sumeerbhola commented Nov 14, 2022 •

edited by cockroach-jira-scripts

Loading