[WIP] storage: intentInterleavingIter support for multiple intents #81259

sumeerbhola · 2022-05-13T23:40:12Z

We allow a key (with multiple versions) to have multiple intents, under the
condition that at most one of the intents is uncommitted.

To aid this behavior we introduce a commitMap, that maintains a logical set
of TxnIDs of transactions that were "simple" in their behavior, and have
committed, where simple is defined as all the following conditions:

No savepoint rollbacks: len(intent.IgnoredSeqNums)==0
Single epoch, i.e., TxnMeta.Epoch==0
Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp

For such transactions, their provisional values can be considered committed
with the current version and local timestamp, i.e., we need no additional
information about the txn other than the TxnID.

Adding to commitMap:
The earliest a txn can be added to the commitMap is when the transaction is
in STAGING and has verified all the conditions for commit. That is, this
can be done before the transition to COMMITTED is replicated. For the node
that has the leaseholder of the range containing the txn record, this
requires no external communication. For other nodes with intents for the
txn, one could piggyback this information on the RPCs for async intent
resolution, and add to the the commitMap before doing the intent resolution
-- this piggybacking would incur 2 consensus rounds of contention. If we
are willing to send the RPC earlier, it will be contention for 1 consensus
round only. Note that these RPCs should also remove locks from the
non-persistent lock table data-structure so should send information about
the keys (like in a LockUpdate but won't remove the durable lock state).

Removal from commitMap:
The commitMap can be considered a cache of TxnIDs. It is helpful to have a
txn in the cache until its intents have been resolved. Additionally,
latchless intent resolution must pin a txn in the map before it releases
latches and unpin when the intent resolution has been applied on the
leaseholder. This pinning behavior is needed to ensure the correctness of
the in-memory concurrency.lockTable, which must maintain the property that
the replicated locks known to it are a subset of the persistent replicated
locks. We are assuming here that there is no lockTable on followers.

Why "simple-committed":

We don't want to have to coordinate intent resolution of these multiple
intents, by mandating that the resolution happen in any particular order.
We want to guarantee that even if the commitMap is cleared (since it is a
cache), we can maintain the invariant that a caller iterating over a key
sees at most one intent. As we will illustrate below, providing this
guarantee requires us to limit the commitMap to only contain
simple-committed txns.

Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and
intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider
the disposition of an intent to be either unknown or simple-committed. In
this example, the disposition of the intent for t4 and t2 is
simple-committed solely based on the fact that there is at least one
version (provisional or otherwise) more recent that the timestamp of the
intent. That is, at most one intent, the one for t5, has a disposition that
needs to rely on knowledge that is not self-contained in the history. For
t5, we must rely on the commitMap to decide whether is unknown or
simple-committed. It is possible that some user of the
intentInterleavingIter saw t5 as simple-committed and a later user sees it
as unknown disposition, if the txn5 got removed from the commitMap -- such
regression is harmless since the latter user will simply have to do intent
resolution. Note that the intent for t5 could get resolved before those for
t4 and t2, and that is also fine since the disposition of t4, t2 stays
simple-committed. If txn5 is aborted and the intent for t5 removed, and
txn4 is no longer in the commitMap, the disposition of t4 could change to
unknown. This is also acceptable, since t5 was only serving as a "local
promise" that t4 was committed, which is simply an optimization. There is
still a less efficient globally available "promise" that t4 is committed,
and intent resolution of t4 is how we will enact that promise.

Maintaining the above guarantees requires that historical versions must not
be garbage collected without resolving intents. This is acceptable since GC
is not latency sensitive.

This PR only introduces the changes for the intentInterleavingIter.
It excludes:

A sophisticated commitMap data-structure. There are code comments
sketching out what properties are desirable.
The iterForKeyVersions implementation used for intent resolution,
now that we can have multiple intents. This will be straightforward.
The changes for latchless intent resolution. These should be
straightforward once we have a commitMap with pinning support and we
extend kv: batch evaluation should operate on a consistent storage snapshot #55461 to use a consistent storage snapshot for intent resolution.
The KV layer changes to send lists of intent keys for simple-committed
txns to the various ranges, so that they can add to their commitMap and
remove from the in-memory lockTable.

Informs #66867

Release note: None

We allow a key (with multiple versions) to have multiple intents, under the condition that at most one of the intents is uncommitted. To aid this behavior we introduce a commitMap, that maintains a logical set of TxnIDs of transactions that were "simple" in their behavior, and have committed, where simple is defined as all the following conditions: - No savepoint rollbacks: len(intent.IgnoredSeqNums)==0 - Single epoch, i.e., TxnMeta.Epoch==0 - Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp For such transactions, their provisional values can be considered committed with the current version and local timestamp, i.e., we need no additional information about the txn other than the TxnID. Adding to commitMap: The earliest a txn can be added to the commitMap is when the transaction is in STAGING and has verified all the conditions for commit. That is, this can be done before the transition to COMMITTED is replicated. For the node that has the leaseholder of the range containing the txn record, this requires no external communication. For other nodes with intents for the txn, one could piggyback this information on the RPCs for async intent resolution, and add to the the commitMap before doing the intent resolution -- this piggybacking would incur 2 consensus rounds of contention. If we are willing to send the RPC earlier, it will be contention for 1 consensus round only. Note that these RPCs should also remove locks from the non-persistent lock table data-structure so should send information about the keys (like in a LockUpdate but won't remove the durable lock state). Removal from commitMap: The commitMap can be considered a cache of TxnIDs. It is helpful to have a txn in the cache until its intents have been resolved. Additionally, latchless intent resolution must pin a txn in the map before it releases latches and unpin when the intent resolution has been applied on the leaseholder. This pinning behavior is needed to ensure the correctness of the in-memory concurrency.lockTable, which must maintain the property that the replicated locks known to it are a subset of the persistent replicated locks. We are assuming here that there is no lockTable on followers. Why "simple-committed": - We don't want to have to coordinate intent resolution of these multiple intents, by mandating that the resolution happen in any particular order. - We want to guarantee that even if the commitMap is cleared (since it is a cache), we can maintain the invariant that a caller iterating over a key sees at most one intent. As we will illustrate below, providing this guarantee requires us to limit the commitMap to only contain simple-committed txns. Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider the disposition of an intent to be either unknown or simple-committed. In this example, the disposition of the intent for t4 and t2 is simple-committed solely based on the fact that there is at least one version (provisional or otherwise) more recent that the timestamp of the intent. That is, at most one intent, the one for t5, has a disposition that needs to rely on knowledge that is not self-contained in the history. For t5, we must rely on the commitMap to decide whether is unknown or simple-committed. It is possible that some user of the intentInterleavingIter saw t5 as simple-committed and a later user sees it as unknown disposition, if the txn5 got removed from the commitMap -- such regression is harmless since the latter user will simply have to do intent resolution. Note that the intent for t5 could get resolved before those for t4 and t2, and that is also fine since the disposition of t4, t2 stays simple-committed. If txn5 is aborted and the intent for t5 removed, and txn4 is no longer in the commitMap, the disposition of t4 could change to unknown. This is also acceptable, since t5 was only serving as a "local promise" that t4 was committed, which is simply an optimization. There is still a less efficient globally available "promise" that t4 is committed, and intent resolution of t4 is how we will enact that promise. Maintaining the above guarantees requires that historical versions must not be garbage collected without resolving intents. This is acceptable since GC is not latency sensitive. This PR only introduces the changes for the intentInterleavingIter. It excludes: - A sophisticated commitMap data-structure. There are code comments sketching out what properties are desirable. - The iterForKeyVersions implementation used for intent resolution, now that we can have multiple intents. This will be straightforward. - The changes for latchless intent resolution. These should be straightforward once we have a commitMap with pinning support and we extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution. - The KV layer changes to send lists of intent keys for simple-committed txns to the various ranges, so that they can add to their commitMap and remove from the in-memory lockTable. Informs cockroachdb#66867 Release note: None

cockroach-teamcity · 2022-05-13T23:40:19Z

This change is

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @AlexTalks, and @nvanbenschoten)

-- commits line 12 at r1:
This is fine for read-write contention, but there may be a problem in the write-write contention case, especially if the throughput is bottlenecked by the write-write contention (some txn is always waiting for a lock to be released). It is probable that the sequence of transactions that write to a key do not have monotonic timestamps, so the successful write will be at a different timestamp than the txn's original timestamp. Which will defeat the simple-committed scheme. If we could tell that all the writes done by a txn to a range were at its commit timestamp, even if it was not the same as the txn's original timestamp, we could regain this optimization. Perhaps we can track this in the Transaction record in a loose manner (to bound the bytes needed for such tracking) -- it doesn't need to be persisted, so we won't be paying another consensus round.

sumeerbhola requested review from ajwerner, AlexTalks and nvanbenschoten May 13, 2022 23:40

sumeerbhola commented May 14, 2022

View reviewed changes

nvanbenschoten removed their request for review September 27, 2022 19:40

sumeerbhola mentioned this pull request Nov 14, 2022

kv: reduce contention to single consensus round via multiple intents #91848

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] storage: intentInterleavingIter support for multiple intents #81259

[WIP] storage: intentInterleavingIter support for multiple intents #81259

sumeerbhola commented May 13, 2022

cockroach-teamcity commented May 13, 2022

sumeerbhola left a comment

[WIP] storage: intentInterleavingIter support for multiple intents #81259

Are you sure you want to change the base?

[WIP] storage: intentInterleavingIter support for multiple intents #81259

Conversation

sumeerbhola commented May 13, 2022

cockroach-teamcity commented May 13, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment