kv: pipeline replicated lock acquisition #117978

nvanbenschoten · 2024-01-19T15:46:20Z

Replicated lock acquisition currently replicates through Raft synchronously. This can impose a high latency on read committed transactions which perform a large number of SELECT FOR UPDATE operations or a large number of FK lookups.

This was discussed in the Read Committed RFC under the Row-Level Locks > Reliability and Enforcement section. That section also outlined a collection of possible optimizations that we could make to avoid this latency penalty.

The most promising of the optimizations is to pipeline this lock acquisition using the using AsyncConsensus infrastructure. This will require client-side tracking and a new protocol to validate pipelined lock acquisition. It is analogous to our existing write intent pipelining approach.

Jira issue: CRDB-35440

The text was updated successfully, but these errors were encountered:

arulajmani · 2024-01-26T20:49:43Z

@nvanbenschoten, @miraradeva and I just spoke about this issue at length. I'll capture some of this discussion below.

We should be able to use most of our existing infrastructure for write pipelining here. One tricky point is how this issue interacts with parallel commits. A transaction qualifies for a parallel commit if all in-flight intents in the last batch (the one which includes the EndTxn request) successfully replicate at a timestamp at or below the transaction's commit timestamp. When we kick off recovery for a STAGING transaction, and we find an intent did not successfully evaluate (yet), we also bump the timestamp cache to ensure it can't land in the future.

We discussed two options here:

We would need something similar for locks -- that is, a way to check whether any locking requests in the last batch were successfully able to acquire replicated locks at or below the transaction's commit timestamp. This should be doable, given we store the entire TxnMeta for the lock. Additionally, if we found that a lock hadn't successfully been acquired, we would need some way to ensure it doesn't happen later. This would probably involve bumping the timestamp cache for replicated locking requests. The room had mixed to negative feelings about this special case handling for replicated locks.
Disallow parallel commits entirely if there are locking requests in the EndTxn batch. This should be fine given we never expect SQL to construct a batch which contains both a locking read and EndTxn in it. Crucially, it allows us to side-step the concerns about replicated lock acquisition attempts coming along after the transaction has been recovered. This is because we know they've already been evaluated and proposed to Raft -- they can't be in-flight concurrent to the EndTxn batch. We're only validating whether the Raft proposal succeeded or not.

We settled on approach 2 as the ways to go here.

We also discussed #64723, and how it is tangentially related to the issue at hand. The reason we don't pipeline DeleteRange requests is because we don't know all the keys that'll be deleted before evaluation[*]. This means we can't construct the parallel commit condition if the last batch also includes a DeleteRange request. However, we're being overly aggressive here by disallowing pipelining of DeleteRange requests entirely -- if we follow the logic from point 2 above, we could pipeline batches containing DeleteRange requests as long as we don't try to perform a parallel commit if the last batch contains a DeleteRange request.

[*] The same logic will apply to Scan and ReverseScan requests, where we won't know all the locks that will be acquired. This means we're not gaining too much by trying to follow the approach described in option 1 above.

The other thing we noted on the call is there might be some trickiness involved with lock re-acquisitions. If we've got a pipelined replicated lock, and its being re-acquired, we want the lock re-acquisition to ensure the pipelined lock replication was successful. We do something similar for intents when a key is being written to multiple times.

We noted that replicated lock re-acquisition can sometimes be no-ops, so we'll have to be careful to think through some of those cases.

@nvanbenschoten, @miraradeva, feel free to add more details I may have missed here!

Previously, ranged requests could not be pipelined. However, there is no good reason to not allow them to be pipeliend -- we just have to take extra care to correctly update in-flight writes tracking on the response path. We do so now. As part of this patch, we introduce two new flags -- canPipeline and canParallelCommit. We use these flags to determine whether batches can be pipelined or committed using parallel commits. This is in contrast to before, where we derived this information from other flags (isIntentWrite, !isRange). This wasn't strictly necessary for this change, but helps clean up the concepts. As a consequence of this change, we now have a distinction between requests that can be pipelined and requests that can be part of a batch that can be committed in parallel. Notably, this applies to DeleteRangeRequests -- they can be pipeliend, but not be committed in parallel. That's because we need to have the entire write set upfront when performing a parallel commit, lest we need to perform recovery -- we don't have this for DeleteRange requests. In the future, we'll extend the concept of canPipeline (and !canParallelCommit) to other locking ranged requests as well. In particular, (replicated) locking {,Reverse}ScanRequests who want to pipeline their lock acquisitions. Closes cockroachdb#64723 Informs cockroachdb#117978 Release note: None

Informs cockroachdb#117978. This commit simplifies the logic in txnPipeliner.chainToInFlightWrites which ensures that we only add a single QueryIntent request to the BatchRequest per overlapping in-flight write. In doing so, it eliminates an unnecessary hash-map by piggy-backing tracking onto the existing btree. It also avoids making an assumption that we only track a single in-flight write per key. Release note: None

Informs cockroachdb#117978. This commit simplifies the logic in `inFlightWriteSet` to track in-flight writes with the same key but different sequence numbers separately. This simplification is done to avoid confusion around in-flight writes with different seq nums and/or different strengths (which will be added shortly), and whether any of these in-flight writes should imply that the others are no longer in-flight. Release note: None

Previously, ranged requests could not be pipelined. However, there is no good reason to not allow them to be pipeliend -- we just have to take extra care to correctly update in-flight writes tracking on the response path. We do so now. As part of this patch, we introduce two new flags -- canPipeline and canParallelCommit. We use these flags to determine whether batches can be pipelined or committed using parallel commits. This is in contrast to before, where we derived this information from other flags (isIntentWrite, !isRange). This wasn't strictly necessary for this change, but helps clean up the concepts. As a consequence of this change, we now have a distinction between requests that can be pipelined and requests that can be part of a batch that can be committed in parallel. Notably, this applies to DeleteRangeRequests -- they can be pipeliend, but not be committed in parallel. That's because we need to have the entire write set upfront when performing a parallel commit, lest we need to perform recovery -- we don't have this for DeleteRange requests. In the future, we'll extend the concept of canPipeline (and !canParallelCommit) to other locking ranged requests as well. In particular, (replicated) locking {,Reverse}ScanRequests who want to pipeline their lock acquisitions. Closes cockroachdb#64723 Informs cockroachdb#117978 Release note: None

119975: kv: allow DeleteRangeRequests to be pipelined r=nvanbenschoten a=arulajmani Previously, ranged requests could not be pipelined. However, there is no good reason to not allow them to be pipeliend -- we just have to take extra care to correctly update in-flight writes tracking on the response path. We do so now. As part of this patch, we introduce two new flags -- canPipeline and canParallelCommit. We use these flags to determine whether batches can be pipelined or committed using parallel commits. This is in contrast to before, where we derived this information from other flags (isIntentWrite, !isRange). This wasn't strictly necessary for this change, but helps clean up the concepts. As a consequence of this change, we now have a distinction between requests that can be pipelined and requests that can be part of a batch that can be committed in parallel. Notably, this applies to DeleteRangeRequests -- they can be pipeliend, but not be committed in parallel. That's because we need to have the entire write set upfront when performing a parallel commit, lest we need to perform recovery -- we don't have this for DeleteRange requests. In the future, we'll extend the concept of canPipeline (and !canParallelCommit) to other locking ranged requests as well. In particular, (replicated) locking {,Reverse}ScanRequests who want to pipeline their lock acquisitions. Closes #64723 Informs #117978 Release note: None 120812: changefeedccl: deflake TestAlterChangefeedAddTargetsDuringBackfill r=rharding6373 a=andyyang890 This patch deflakes `TestAlterChangefeedAddTargetsDuringBackfill` by increasing the max batch size used for changefeed initial scans. Previously, if we were unlucky, the batch sizes could be too small leading to a timeout while waiting for the initial scan to complete. Fixes #120744 Release note: None 121023: roachtest: add disk-stalled/wal-failover/among-stores test r=sumeerbhola a=jbowens Introduce a new roachtest that simulates disk stalls on one store of a 3-node cluster with two stores per node, and the --wal-failover=among-stores configuration set. The WAL failover configuration should ensure the workload continues uninterrupted until it becomes blocked on disk reads. Informs #119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 121073: master: Update pkg/testutils/release/cockroach_releases.yaml r=rail a=github-actions[bot] Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: CRL Release bot <[email protected]>

Informs cockroachdb#117978. This commit simplifies the logic in txnPipeliner.chainToInFlightWrites which ensures that we only add a single QueryIntent request to the BatchRequest per overlapping in-flight write. In doing so, it eliminates an unnecessary hash-map by piggy-backing tracking onto the existing btree. It also avoids making an assumption that we only track a single in-flight write per key. Release note: None

Informs cockroachdb#117978. This commit simplifies the logic in `inFlightWriteSet` to track in-flight writes with the same key but different sequence numbers separately. This simplification is done to avoid confusion around in-flight writes with different seq nums and/or different strengths (which will be added shortly), and whether any of these in-flight writes should imply that the others are no longer in-flight. Release note: None

Informs cockroachdb#117978. This commit updates the `txnPipeliner` to use the response keys from `Get`, `Scan`, `ReverseScan`, and `DeleteRange` requests to track pipelined and non-pipelined lock acquisitions / intent writes, instead of assuming that the requests could have left intents anywhere in their request span. This more precise tracking avoid broad ranged intent resolution when more narrow intent resolution is possible. It will also be used by cockroachdb#117978 to track in-flight replicated lock acquisition. This is a WIP. Some tests need to be updated. Release note: None

120865: workload/schemachange: correct error code for referencing dropping enum r=rafiss a=annrpom This patch changes the error code expected for referencing an enum that's in the process of being dropped to the proper one. Epic: none Release note: None 121062: orchestration: released CockroachDB version 23.2.3. Next version: 23.2.4 r=yecs1999 a=cockroach-teamcity Release note: None Epic: None Release justification: non-production (release infra) change. 121065: kv: prep in-flight write tracking for replicated locks r=nvanbenschoten a=nvanbenschoten Informs #117978. This PR includes a pair of simplifications to the `txnPipeliner` to make fewer assumptions about in-flight writes, as a way to prepare for #117978. #### kv: remove chainedKeys hash-map from chainToInFlightWrites, simplify This commit simplifies the logic in txnPipeliner.chainToInFlightWrites which ensures that we only add a single QueryIntent request to the BatchRequest per overlapping in-flight write. In doing so, it eliminates an unnecessary hash-map by piggy-backing tracking onto the existing btree. It also avoids making an assumption that we only track a single in-flight write per key. #### kv: track in-flight writes with same key and different seq nums This commit simplifies the logic in `inFlightWriteSet` to track in-flight writes with the same key but different sequence numbers separately. This simplification is done to avoid confusion around in-flight writes with different seq nums and/or different strengths (which will be added shortly), and whether any of these in-flight writes should imply that the others are no longer in-flight. Release note: None Co-authored-by: Annie Pompa <[email protected]> Co-authored-by: Justin Beaver <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

Previously, QueryIntent requests were only used to verify whether an intent was successfully evaluated and replicated. This patch extends QueryIntent request to also be able to verify whether a pipelined shared or exclusive lock was successfully replicated or not. Informs cockroachdb#117978 Release note: None

Informs cockroachdb#117978. This commit updates the `txnPipeliner` to use the response keys from `Get`, `Scan`, `ReverseScan`, and `DeleteRange` requests to track pipelined and non-pipelined lock acquisitions / intent writes, instead of assuming that the requests could have left intents anywhere in their request span. This more precise tracking avoid broad ranged intent resolution when more narrow intent resolution is possible. It will also be used by cockroachdb#117978 to track in-flight replicated lock acquisition. Release note: None

121086: kv: use response keys of ranged writes for lock tracking r=nvanbenschoten a=nvanbenschoten Informs #117978. This PR updates the `txnPipeliner` to use the response keys from `Get`, `Scan`, `ReverseScan`, and `DeleteRange` requests to track pipelined and non-pipelined lock acquisitions / intent writes, instead of assuming that the requests could have left intents anywhere in their request span. This more precise tracking avoid broad ranged intent resolution when more narrow intent resolution is possible. It will also be used by #117978 to track in-flight replicated lock acquisition. This is a WIP. Some tests need to be updated. Release note: None 121152: roachtest: adjust WAL failover disk stall roachtest's logging config r=sumeerbhola a=jbowens The previous logging config outputted all logs to stderr too, which is not buffered and could become blocked on the stalled disk. Epic: none Release note: none 121156: sqlstats: fix reset in-memory sql stats on flush r=xinhaoz a=xinhaoz After flushing in-memory sql stats to disk, we reset and prep each app container for reuse by: - Decrementing the per-node fingerprint counter by the size of the app container. This counter prevents us from writing more sql stats when we reach the maximum amount of fingerprints stored in memory. - Clearing the container and reducing its capacity to 1/2. When introducing atomic flushing, we swapped the 2 ops above in the reset step, resulting in the decrement step being a noop. The counter never resets and eventually results in each attempt at writing sql stats to be throttled which then also signals the sql stats flush worker. Epic: none Fixes: #121134 Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]>

Previously, QueryIntent requests were only used to verify whether an intent was successfully evaluated and replicated. This patch extends QueryIntent request to also be able to verify whether a pipelined shared or exclusive lock was successfully replicated or not. Informs cockroachdb#117978 Release note: None

119933: kv: add ability to verify pipelined replicated shared/exclusive locks r=nvanbenschoten a=arulajmani Previously, QueryIntent requests were only used to verify whether an intent was successfully evaluated and replicated. This patch extends QueryIntent request to also be able to verify whether a pipelined shared or exclusive lock was successfully replicated or not. Informs #117978 Release note: None 120787: ui: use max aggregator for commit latency on changefeed dashboard r=rharding6373 a=rharding6373 Previously, the commit latency in the changefeed dashboard in the db console would be aggregated by sum across all nodes. This was confusing for users who might see unexpectedly high commit latency. In this change, we use max aggregation for the commit latency so that users see the max commit latency from all the nodes instead of the sum. This provides more useful observability into changefeed behavior. Fixes: #119246 Fixes: #112947 Epic: None Release note (ui change): The "Commit Latency" chart in the changefeed dashboard now aggregates by max instead of by sum for multi-node changefeeds. This more accurately reflects the amount of time for events to be acknowledged by the downstream sink. 121217: backupccl: fix data race with admission pacer r=msbutler a=aadityasondhi We now use one pacer per fileSSTSink. Fixes #121199. Fixes #121202. Fixes #121201. Fixes #121200. Fixes #121198. Fixes #121197. Fixes #121196. Fixes #121195. Fixes #121194. Fixes #121193. Fixes #121192. Fixes #121191. Fixes #121190. Fixes #121189. Fixes #121188. Fixes #121187. Release note: None 121222: optbuilder: fix recently introduced nil pointer in error case r=yuzefovich a=yuzefovich This commit fixes a recently introduced nil pointer internal error when attempting to CALL not a procedure that is specified not by its name. `ResolvableFunctionReference` might not have `ReferenceByName`, so this commit switches to using `FunctionReference` that is always set. Fixes: #121095. Release note: None Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

Fixes cockroachdb#117978. TODO: Needs testing. This commit completes the client-side handling of replicated lock acquisition pipelining. Replicated lock acquisition through Get, Scan, and ReverseScan requests now qualifies to be pipelined. The `txnPipeliner` is updated to track the strength associated with each in-flight write and pass that along to the corresponding QueryIntentRequest. Release note: None

Fixes cockroachdb#117978. This commit completes the client-side handling of replicated lock acquisition pipelining. Replicated lock acquisition through Get, Scan, and ReverseScan requests now qualifies to be pipelined. The `txnPipeliner` is updated to track the strength associated with each in-flight write and pass that along to the corresponding QueryIntentRequest. Release note: None

121088: kv: pipeline replicated lock acquisition r=nvanbenschoten a=nvanbenschoten Fixes #117978. Builds upon the foundation laid in [#119975](#119975), [#119933](#119933), [#121065](#121065), and [#121086](#121086). This commit completes the client-side handling of replicated lock acquisition pipelining. Replicated lock acquisition through `Get`, `Scan`, and `ReverseScan` requests now qualifies to be pipelined. The `txnPipeliner` is updated to track the strength associated with each in-flight write and pass that along to the corresponding `QueryIntentRequest`. See [benchmark with TPC-C results here](#121088 (comment)). Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

Fixes #117978. This commit completes the client-side handling of replicated lock acquisition pipelining. Replicated lock acquisition through Get, Scan, and ReverseScan requests now qualifies to be pipelined. The `txnPipeliner` is updated to track the strength associated with each in-flight write and pass that along to the corresponding QueryIntentRequest. Release note: None

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team A-read-committed Related to the introduction of Read Committed labels Jan 19, 2024

nvanbenschoten mentioned this issue Feb 9, 2024

kv: replicate unreplicated locks before lease transfers, splits, and merges #75456

Open

arulajmani mentioned this issue Mar 5, 2024

kv: allow DeleteRangeRequests to be pipelined #119975

Merged

nvanbenschoten added GA-blocker branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Mar 6, 2024

arulajmani self-assigned this Mar 13, 2024

nvanbenschoten self-assigned this Mar 25, 2024

nvanbenschoten mentioned this issue Mar 25, 2024

kv: prep in-flight write tracking for replicated locks #121065

Merged

nvanbenschoten mentioned this issue Mar 26, 2024

kv: use response keys of ranged writes for lock tracking #121086

Merged

nvanbenschoten mentioned this issue Mar 26, 2024

kv: pipeline replicated lock acquisition #121088

Merged

craig bot closed this as completed in af394f7 Apr 3, 2024

exalate-issue-sync bot unassigned arulajmani Apr 3, 2024

blathers-crl bot mentioned this issue Apr 3, 2024

release-24.1: kv: pipeline replicated lock acquisition #121710

Merged

nvanbenschoten mentioned this issue Apr 19, 2024

kv: buffered writes #72614

Open

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: pipeline replicated lock acquisition #117978

kv: pipeline replicated lock acquisition #117978

nvanbenschoten commented Jan 19, 2024 •

edited by cockroach-jira-scripts

Loading

arulajmani commented Jan 26, 2024

kv: pipeline replicated lock acquisition #117978

kv: pipeline replicated lock acquisition #117978

Comments

nvanbenschoten commented Jan 19, 2024 • edited by cockroach-jira-scripts Loading

arulajmani commented Jan 26, 2024

nvanbenschoten commented Jan 19, 2024 •

edited by cockroach-jira-scripts

Loading