-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus #95911
kv: perform PushTxn(PUSH_TIMESTAMP) without Raft consensus #95911
Conversation
ea417d0
to
9128ec7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
-- commits
line 6 at r1:
nit: "back to"
pkg/kv/kvserver/replica_tscache.go
line 410 at r1 (raw file):
// PushTxn(ABORT). As a result of this mechanism, a pusher transaction never // needs to create the transaction record for contending transactions that // it forms write-write conflicts with.
The phrasing here leaves the handling of write-read conflicts ambiguous. Consider adding a reference to MinTxnCommitTS
here, so that readers know where to look if they're curious about Txn record creation semantics when there's a write-read conflict.
pkg/kv/kvserver/replica_tscache.go
line 449 at r1 (raw file):
// v | v EndTxn(!STAGING) v v | // +--------------------+ or PushTxn(ABORT) +--------------------+ // | | then: txn.ts -> v1 | |
txn.ts -> v1
should only happens when EndTxn(!Staging)
, right? It doesn't make sense in the PushTxn(ABORT)
case -- should we split these cases out now?
This commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful PushTxn(TIMESTAMP) to a pushee with an existing record without re-writing its transaction record. See the changes to the state machine diagram attached to `Replica.CanCreateTxnRecord` for a visual depiction of this change.
This commit simplifies logic in PushTxnRequest that promoted a PUSH_TIMESTAMP to a PUSH_ABORT when it found a STAGING transaction record which it knew to be part of a failed parallel commit attempt. The logic was trying to be smart and minimize the cases where it needed to promote a PUSH_TIMESTAMP to a PUSH_ABORT. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where a PUSH_TIMESTAMP would succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them.
Fixes cockroachdb#94728. With the previous commit, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. This commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus.
9128ec7
to
3dbb321
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=arulajmani
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)
Previously, arulajmani (Arul Ajmani) wrote…
nit: "back to"
Done.
pkg/kv/kvserver/replica_tscache.go
line 410 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
The phrasing here leaves the handling of write-read conflicts ambiguous. Consider adding a reference to
MinTxnCommitTS
here, so that readers know where to look if they're curious about Txn record creation semantics when there's a write-read conflict.
Good idea, done.
pkg/kv/kvserver/replica_tscache.go
line 449 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
txn.ts -> v1
should only happens whenEndTxn(!Staging)
, right? It doesn't make sense in thePushTxn(ABORT)
case -- should we split these cases out now?
Yeah, you're correct about that, good catch. We could forward the record's WriteTimestamp
to MinTxnCommitTS()
when aborting the txn on a PushTxn(ABORT), but we don't because the write timestamp is meaningless once the txn is aborted. I'm going to leave this as is out of a desire not to make this diagram any more crowded and harder to read than it already is while acknowledging that it is not entirely accurate in this specific case.
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
This PR contains a sequence of three commits that combine to resolve #94728.
check txn push marker on commit, not txn record creation
The first commit moves the point when a transaction checks the timestamp cache for its minimum commit timestamp from transaction record creation time back to commit time. This allows us to use the timestamp cache to communicate successful
PushTxn(TIMESTAMP)
to a pushee with an existing record without rewriting its transaction record.For details, see the changes to the state machine diagram attached to
Replica.CanCreateTxnRecord
for a visual depiction of this change.always promote PUSH_TIMESTAMP to PUSH_ABORT on failed staging record
The second commit simplifies logic in PushTxnRequest that promoted a
PUSH_TIMESTAMP
to aPUSH_ABORT
when itfound a STAGING transaction record that it knew to be part of a failed parallel commit attempt.
The logic tried to be smart and minimize the cases where it needed to promote a
PUSH_TIMESTAMP
to aPUSH_ABORT
. It was avoiding doing so if it had previously found an intent with a higher epoch. In practice, this optimization doesn't seem to matter. It was also making logic in a following commit harder to write because it was preserving cases where aPUSH_TIMESTAMP
would succeed against a STAGING transaction record. We don't want to support such state transitions, so eliminate them.don't rewrite txn record on PushTxn(TIMESTAMP)
With the previous two commits, transactions will check the timestamp cache before committing to determine whether they have had their commit timestamp pushed. The final commit exploits this to avoid ever rewriting a transaction's record on a timestamp push. Instead, the timestamp cache is used, regardless of whether the record already existed or not. Doing so avoids consensus.
Release note: None