Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] storage: intentInterleavingIter support for multiple intents #81259

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sumeerbhola
Copy link
Collaborator

We allow a key (with multiple versions) to have multiple intents, under the
condition that at most one of the intents is uncommitted.

To aid this behavior we introduce a commitMap, that maintains a logical set
of TxnIDs of transactions that were "simple" in their behavior, and have
committed, where simple is defined as all the following conditions:

  • No savepoint rollbacks: len(intent.IgnoredSeqNums)==0
  • Single epoch, i.e., TxnMeta.Epoch==0
  • Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp

For such transactions, their provisional values can be considered committed
with the current version and local timestamp, i.e., we need no additional
information about the txn other than the TxnID.

Adding to commitMap:
The earliest a txn can be added to the commitMap is when the transaction is
in STAGING and has verified all the conditions for commit. That is, this
can be done before the transition to COMMITTED is replicated. For the node
that has the leaseholder of the range containing the txn record, this
requires no external communication. For other nodes with intents for the
txn, one could piggyback this information on the RPCs for async intent
resolution, and add to the the commitMap before doing the intent resolution
-- this piggybacking would incur 2 consensus rounds of contention. If we
are willing to send the RPC earlier, it will be contention for 1 consensus
round only. Note that these RPCs should also remove locks from the
non-persistent lock table data-structure so should send information about
the keys (like in a LockUpdate but won't remove the durable lock state).

Removal from commitMap:
The commitMap can be considered a cache of TxnIDs. It is helpful to have a
txn in the cache until its intents have been resolved. Additionally,
latchless intent resolution must pin a txn in the map before it releases
latches and unpin when the intent resolution has been applied on the
leaseholder. This pinning behavior is needed to ensure the correctness of
the in-memory concurrency.lockTable, which must maintain the property that
the replicated locks known to it are a subset of the persistent replicated
locks. We are assuming here that there is no lockTable on followers.

Why "simple-committed":

  • We don't want to have to coordinate intent resolution of these multiple
    intents, by mandating that the resolution happen in any particular order.
  • We want to guarantee that even if the commitMap is cleared (since it is a
    cache), we can maintain the invariant that a caller iterating over a key
    sees at most one intent. As we will illustrate below, providing this
    guarantee requires us to limit the commitMap to only contain
    simple-committed txns.

Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and
intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider
the disposition of an intent to be either unknown or simple-committed. In
this example, the disposition of the intent for t4 and t2 is
simple-committed solely based on the fact that there is at least one
version (provisional or otherwise) more recent that the timestamp of the
intent. That is, at most one intent, the one for t5, has a disposition that
needs to rely on knowledge that is not self-contained in the history. For
t5, we must rely on the commitMap to decide whether is unknown or
simple-committed. It is possible that some user of the
intentInterleavingIter saw t5 as simple-committed and a later user sees it
as unknown disposition, if the txn5 got removed from the commitMap -- such
regression is harmless since the latter user will simply have to do intent
resolution. Note that the intent for t5 could get resolved before those for
t4 and t2, and that is also fine since the disposition of t4, t2 stays
simple-committed. If txn5 is aborted and the intent for t5 removed, and
txn4 is no longer in the commitMap, the disposition of t4 could change to
unknown. This is also acceptable, since t5 was only serving as a "local
promise" that t4 was committed, which is simply an optimization. There is
still a less efficient globally available "promise" that t4 is committed,
and intent resolution of t4 is how we will enact that promise.

Maintaining the above guarantees requires that historical versions must not
be garbage collected without resolving intents. This is acceptable since GC
is not latency sensitive.

This PR only introduces the changes for the intentInterleavingIter.
It excludes:

  • A sophisticated commitMap data-structure. There are code comments
    sketching out what properties are desirable.
  • The iterForKeyVersions implementation used for intent resolution,
    now that we can have multiple intents. This will be straightforward.
  • The changes for latchless intent resolution. These should be
    straightforward once we have a commitMap with pinning support and we
    extend kv: batch evaluation should operate on a consistent storage snapshot #55461 to use a consistent storage snapshot for intent resolution.
  • The KV layer changes to send lists of intent keys for simple-committed
    txns to the various ranges, so that they can add to their commitMap and
    remove from the in-memory lockTable.

Informs #66867

Release note: None

We allow a key (with multiple versions) to have multiple intents, under the
condition that at most one of the intents is uncommitted.

To aid this behavior we introduce a commitMap, that maintains a logical set
of TxnIDs of transactions that were "simple" in their behavior, and have
committed, where simple is defined as all the following conditions:
- No savepoint rollbacks: len(intent.IgnoredSeqNums)==0
- Single epoch, i.e., TxnMeta.Epoch==0
- Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp
For such transactions, their provisional values can be considered committed
with the current version and local timestamp, i.e., we need no additional
information about the txn other than the TxnID.

Adding to commitMap:
The earliest a txn can be added to the commitMap is when the transaction is
in STAGING and has verified all the conditions for commit. That is, this
can be done before the transition to COMMITTED is replicated. For the node
that has the leaseholder of the range containing the txn record, this
requires no external communication. For other nodes with intents for the
txn, one could piggyback this information on the RPCs for async intent
resolution, and add to the the commitMap before doing the intent resolution
-- this piggybacking would incur 2 consensus rounds of contention. If we
are willing to send the RPC earlier, it will be contention for 1 consensus
round only. Note that these RPCs should also remove locks from the
non-persistent lock table data-structure so should send information about
the keys (like in a LockUpdate but won't remove the durable lock state).

Removal from commitMap:
The commitMap can be considered a cache of TxnIDs. It is helpful to have a
txn in the cache until its intents have been resolved. Additionally,
latchless intent resolution must pin a txn in the map before it releases
latches and unpin when the intent resolution has been applied on the
leaseholder. This pinning behavior is needed to ensure the correctness of
the in-memory concurrency.lockTable, which must maintain the property that
the replicated locks known to it are a subset of the persistent replicated
locks. We are assuming here that there is no lockTable on followers.

Why "simple-committed":
- We don't want to have to coordinate intent resolution of these multiple
  intents, by mandating that the resolution happen in any particular order.
- We want to guarantee that even if the commitMap is cleared (since it is a
  cache), we can maintain the invariant that a caller iterating over a key
  sees at most one intent. As we will illustrate below, providing this
  guarantee requires us to limit the commitMap to only contain
  simple-committed txns.

Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and
intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider
the disposition of an intent to be either unknown or simple-committed. In
this example, the disposition of the intent for t4 and t2 is
simple-committed solely based on the fact that there is at least one
version (provisional or otherwise) more recent that the timestamp of the
intent. That is, at most one intent, the one for t5, has a disposition that
needs to rely on knowledge that is not self-contained in the history. For
t5, we must rely on the commitMap to decide whether is unknown or
simple-committed. It is possible that some user of the
intentInterleavingIter saw t5 as simple-committed and a later user sees it
as unknown disposition, if the txn5 got removed from the commitMap -- such
regression is harmless since the latter user will simply have to do intent
resolution. Note that the intent for t5 could get resolved before those for
t4 and t2, and that is also fine since the disposition of t4, t2 stays
simple-committed. If txn5 is aborted and the intent for t5 removed, and
txn4 is no longer in the commitMap, the disposition of t4 could change to
unknown. This is also acceptable, since t5 was only serving as a "local
promise" that t4 was committed, which is simply an optimization. There is
still a less efficient globally available "promise" that t4 is committed,
and intent resolution of t4 is how we will enact that promise.

Maintaining the above guarantees requires that historical versions must not
be garbage collected without resolving intents. This is acceptable since GC
is not latency sensitive.

This PR only introduces the changes for the intentInterleavingIter.
It excludes:
- A sophisticated commitMap data-structure. There are code comments
  sketching out what properties are desirable.
- The iterForKeyVersions implementation used for intent resolution,
  now that we can have multiple intents. This will be straightforward.
- The changes for latchless intent resolution. These should be
  straightforward once we have a commitMap with pinning support and we
  extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution.
- The KV layer changes to send lists of intent keys for simple-committed
  txns to the various ranges, so that they can add to their commitMap and
  remove from the in-memory lockTable.

Informs cockroachdb#66867

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @AlexTalks, and @nvanbenschoten)


-- commits line 12 at r1:
This is fine for read-write contention, but there may be a problem in the write-write contention case, especially if the throughput is bottlenecked by the write-write contention (some txn is always waiting for a lock to be released). It is probable that the sequence of transactions that write to a key do not have monotonic timestamps, so the successful write will be at a different timestamp than the txn's original timestamp. Which will defeat the simple-committed scheme. If we could tell that all the writes done by a txn to a range were at its commit timestamp, even if it was not the same as the txn's original timestamp, we could regain this optimization. Perhaps we can track this in the Transaction record in a loose manner (to bound the bytes needed for such tracking) -- it doesn't need to be persisted, so we won't be paying another consensus round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants