kv: reduce contention to single consensus round via multiple intents #91848
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
The high-level idea was discussed in #41720 but that issue is very broad, and some details were lacking.
#66867 is a related issue which is limited to releasing latches before starting to evaluate intent resolution. It omits a specific design, though does hint at an approach that is further elaborated here.
To reduce contention to a single consensus round, i.e., the round to write the intent, we want to start doing conflicting writes and reads as soon as the txn transitions to STAGING and has verified all the conditions for commit, but before it actually does the transition to COMMIT (which involves another round of consensus). The details are somewhat tricky wrt maintaining cross data-structure invariants within a node, and wrt loss of in-memory state (due to memory pressure or node restart). These are sketched out below.
Simple Committed
We allow a key (with multiple versions) to have multiple intents, under the condition that at most one of the intents is uncommitted.
To aid this behavior we introduce a per-node in-memory commitMap, that maintains a logical set of TxnIDs of transactions that were "simple" in their behavior, and have committed, where simple is defined as all the following conditions:
The first two conditions ensure that there are no stray intents written by that transaction that will not commit. That plus the last condition ensures that their provisional values can be considered committed with the current version and local timestamp, i.e., we need no additional information about the txn other than the TxnID.
Adding to commitMap
The earliest a txn can be added to the commitMap is when the transaction is in STAGING and has verified all the conditions for commit. That is, this can be done before the transition to COMMITTED is replicated. For the node that is the leaseholder of the range containing the txn record, this requires no external communication. For other nodes with intents for the txn, one would send a RPC (this is earlier than the RPC for async intent resolution). These RPCs should also remove locks from the non-persistent lock table data-structure so need to send information about the keys (like in a LockUpdate but we won't remove the durable lock state).
Removal from commitMap
The commitMap can be considered a cache of TxnIDs. It is helpful to have a txn in the cache until its intents have been resolved, but we don't want to guarantee that due to memory pressure or node failure.
Latchless intent resolution must pin a txn in the map before it releases latches and unpin when the intent resolution has been applied on the leaseholder. This pinning behavior is needed to ensure the correctness of the in-memory concurrency.lockTable, which must maintain the property that the replicated locks known to it are a subset of the persistent replicated locks (we are assuming here that there is no lockTable on followers). If it did not pin, and a txn was evicted from the commitMap during the intent resolution, a conflicting reader/writer could find the persistent lock and add it to the in-memory concurrency.lockTable even though the lock has been resolved.
Why "simple-committed"?
Goals:
Example:
Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and intents for t5, t4, t2, with corresponding txns txn5, txn4, txn2. We consider the disposition of an intent to be either unknown or simple-committed. In this example, the disposition of the intent for t4 and t2 is simple-committed solely based on the fact that there is at least one version (provisional or otherwise) more recent that the timestamp of the intent. That is, at most one intent, the one for t5, has a disposition that needs to rely on knowledge that is not self-contained in the history. For t5, we must rely on the commitMap to decide whether is unknown or
simple-committed.
It is possible that some user of the intentInterleavingIter saw t5 as simple-committed and a later user sees it as unknown disposition, if the txn5 got removed from the commitMap. Such regression is harmless since the latter user will simply have to do intent resolution. Note that the intent for t5 could get resolved before those for t4 and t2, and that is also fine since the disposition of t4, t2 stays simple-committed. If txn5 is aborted and the intent for t5 removed, and txn4 is no longer in the commitMap, the disposition of t4 could change to unknown. This is also acceptable, since t5 was only serving as a "local promise" that t4 was committed, which is simply an optimization. There is still a less efficient globally available "promise" that t4 is committed, and intent resolution of t4 is how we will enact that promise.
Maintaining the above guarantees requires that historical versions must not be garbage collected without resolving intents. This is acceptable since GC is not latency sensitive.
Performance Considerations
Some of the intentInterleavingIter changes are prototyped in #81259
cc: @nvanbenschoten @AlexTalks
Jira issue: CRDB-21456
The text was updated successfully, but these errors were encountered: