raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

irfansharif · 2023-01-24T14:50:09Z

Part of #95563. Predecessor to #95637.

This commit introduces two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded}¹, to indicate whether the entry came with sideloaded data². Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll "admit the work without blocking", which is further explained in #95637.

The decision to use replication admission control happens above raft and a per-entry basis. If using replication admission control, AC-specific metadata will be plumbed down as part of the marshaled raft command. This too is explained in in #95637, specifically, the 'RaftAdmissionMeta' section. This commit then adds an unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication admission control. Since we're using a different prefix byte for raft commands, one not recognized in earlier CRDB versions, we need explicit versioning. We add it out of development convenience -- adding version gates is most prone to merge conflicts. We expect to use it shortly, before alpha/beta cuts.

Release note: None

Now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC. ↩
These are typically AddSSTs, the storage for which is treated differently for performance reasons. ↩

cockroach-teamcity · 2023-01-24T14:50:25Z

This change is

This test started failing for commits that introduced additional version gates, like in cockroachdb#95748. It's because we were not overriding the binary version the server started off with (it defaults to the last added version). It failed with: Error Trace: pkg/upgrade/upgrades/helpers_test.go:67 pkg/upgrade/upgrades/helpers_test.go:50 pkg/upgrade/upgrades/key_visualizer_migration_test.go:48 Error: pq: versions cannot be downgraded (attempting to downgrade from 1000022.2-34 to 1000022.2-32) Release note: None

tbg

LGTM mod the open comments.

pkg/clusterversion/cockroach_versions.go

pkg/kv/kvserver/raftlog/encoding.go

Part of cockroachdb#95563. Predecessor to cockroachdb#95637. This commit introduces two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded}[^1], to indicate whether the entry came with sideloaded data[^2]. Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll "admit the work without blocking", which is further explained in cockroachdb#95637. The decision to use replication admission control happens above raft and a per-entry basis. If using replication admission control, AC-specific metadata will be plumbed down as part of the marshaled raft command. This too is explained in in cockroachdb#95637, specifically, the 'RaftAdmissionMeta' section. When using these encodings in the future, we'll need to tied it to a version gate since we're using a prefix byte for raft commands one that's recognized in earlier CRDB versions. [^1]: Now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC. [^2]: These are typically AddSSTs, the storage for which is treated differently for performance reasons. Release note: None

irfansharif

TFTR, removed the version gate + responded in line.

bors r+

craig · 2023-01-26T19:00:39Z

Build succeeded:

Bazel Essential CI (Cockroach)

Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we want to enable below-raft admission control through the following API on kvadmission.Controller: // AdmitRaftEntry informs admission control of a raft log entry being // written to storage (for the given tenant, the specific range, and // on the named store). AdmitRaftEntry( context.Context, roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry. ) This serves as the integration point for log entries received below raft right as they're being written to stable storage. It's a non-blocking interface since we're below-raft and in the raft.Ready() loop. What it effectively does is enqueues a "virtual" work item in the underlying StoreWorkQueue mediating all store IO. This virtual work item is what later gets dequeued once the IO granter informs the work queue of newly available IO tokens. When enqueueing the virtual work item, we still update the StoreWorkQueue's physically-accounted-for bytes since the actual write is not deferred, and timely statistic updates improves accuracy for the underlying linear models (that map between accounted-for writes and observed L0 growth, using it to inform IO token generation rates). For each of the arguments above: - The roachpb.TenantID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation). - The roachpb.StoreID to find the right store work queue on multi-store nodes. We'll also use the StoreID when informing the origin node of log entries being admitted[^1]. - We pass in the roachpb.RangeID on behalf of which work is being admitted. This, along side the raftpb.Entry.{Term,Index} for the replicated write uniquely identifies where the write is to end up. We use these identifiers to: - Return flow tokens on the origin node[^1][^2]. - In WorkQueue ordering -- for replicated writes below-raft, we ignore CreateTime/epoch-LIFO, and instead sort by priority and within a priority, sort by log position. - For standard work queue ordering, our work item needs to include the CreateTime and AdmissionPriority, details that are passed down using dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry parameter above. - Since the raftpb.Entry encodes within it its origin node[^4], it will be used post-admission to dispatch flow tokens to the right node. This integration is left to future PRs. We use the above to populate the following fields on a per-(replicated write)work basis: // ReplicatedWorkInfo groups everything needed to admit replicated // writes, done so asynchronously below-raft as part of replication // admission control. type ReplicatedWorkInfo struct { RangeID roachpb.RangeID Origin roachpb.NodeID LogPosition LogPosition Ingested bool } Since admission is happening below-raft where the size of the write is known, we no longer need per-work estimates for upfront IO token deductions. Since admission is asynchronous, we also don't use the AdmittedWorkDone interface which was to make token adjustments (without blocking) given the upfront estimates. We still want to intercept the exact point when some write work gets admitted in order to inform the origin node so it can release flow tokens. We do so through the following interface satisfied by the StoreWorkQueue: // onAdmittedReplicatedWork is used to intercept the // point-of-admission for replicated writes. type onAdmittedReplicatedWork interface { admittedReplicatedWork( tenantID roachpb.TenantID, pri admissionpb.WorkPriority, rwi ReplicatedWorkInfo, requestedTokens int64, ) } [^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637. [^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor} introduced in cockroachdb#95637. Token deductions and returns are tied to raft log positions. [^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in cockroachdb#95748. [^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637. message RaftAdmissionMeta { int32 admission_priority = ...; int64 admission_create_time = ...; int32 admission_origin_node = ...; } Release note: None

Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we want to enable below-raft admission control through the following API on kvadmission.Controller: // AdmitRaftEntry informs admission control of a raft log entry being // written to storage (for the given tenant, the specific range, and // on the named store). AdmitRaftEntry( context.Context, roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry. ) This serves as the integration point for log entries received below raft right as they're being written to stable storage. It's a non-blocking interface since we're below-raft and in the raft.Ready() loop. What it effectively does is enqueues a "virtual" work item in the underlying StoreWorkQueue mediating all store IO. This virtual work item is what later gets dequeued once the IO granter informs the work queue of newly available IO tokens. When enqueueing the virtual work item, we still update the StoreWorkQueue's physically-accounted-for bytes since the actual write is not deferred, and timely statistic updates improves accuracy for the underlying linear models that map between accounted-for writes and observed L0 growth (using it to inform IO token generation rates). For each of the arguments above: - The roachpb.TenantID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation). - The roachpb.StoreID to find the right store work queue on multi-store nodes. We'll also use the StoreID when informing the origin node of log entries being admitted[^1]. - We pass in the roachpb.RangeID on behalf of which work is being admitted. This, along side the raftpb.Entry.{Term,Index} for the replicated write uniquely identifies where the write is to end up. We use these identifiers to return flow tokens on the origin node[^1][^2]. - For standard work queue ordering, our work item needs to include the CreateTime and AdmissionPriority, details that are passed down using dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry parameter above. - Since the raftpb.Entry encodes within it its origin node[^4], it will be used post-admission to dispatch flow tokens to the right node. This integration is left to future PRs. We use the above to populate the following fields on a per-(replicated write)work basis: // ReplicatedWorkInfo groups everything needed to admit replicated // writes, done so asynchronously below-raft as part of replication // admission control. type ReplicatedWorkInfo struct { RangeID roachpb.RangeID Origin roachpb.NodeID LogPosition LogPosition Ingested bool } Since admission is happening below-raft where the size of the write is known, we no longer need per-work estimates for upfront IO token deductions. Since admission is asynchronous, we also don't use the AdmittedWorkDone interface which was to make token adjustments (without blocking) given the upfront estimates. We still want to intercept the exact point when some write work gets admitted in order to inform the origin node so it can release flow tokens. We do so through the following interface satisfied by the StoreWorkQueue: // onAdmittedReplicatedWork is used to intercept the // point-of-admission for replicated writes. type onAdmittedReplicatedWork interface { admittedReplicatedWork( tenantID roachpb.TenantID, pri admissionpb.WorkPriority, rwi ReplicatedWorkInfo, requestedTokens int64, ) } [^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637. [^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor} introduced in cockroachdb#95637. Token deductions and returns are tied to raft log positions. [^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in cockroachdb#95748. [^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637. message RaftAdmissionMeta { int32 admission_priority = ...; int64 admission_create_time = ...; int32 admission_origin_node = ...; } Release note: None

97599: admission: support non-blocking {Store,}WorkQueue.Admit() r=irfansharif a=irfansharif Part of #95563. For end-to-end flow control of replicated writes, we want to enable below-raft admission control through the following API on kvadmission.Controller: ```go // AdmitRaftEntry informs admission control of a raft log entry being // written to storage (for the given tenant, the specific range, and // on the named store). AdmitRaftEntry( context.Context, roachpb.TenantID, roachpb.StoreID, roachpb.RangeID, raftpb.Entry. ) ``` This serves as the integration point for log entries received below raft right as they're being written to stable storage. It's a non-blocking interface since we're below-raft and in the raft.Ready() loop. What it effectively does is enqueues a "virtual" work item in the underlying StoreWorkQueue mediating all store IO. This virtual work item is what later gets dequeued once the IO granter informs the work queue of newly available IO tokens. When enqueueing the virtual work item, we still update the StoreWorkQueue's physically-accounted-for bytes since the actual write is not deferred, and timely statistic updates improves accuracy for the underlying linear models (that map between accounted-for writes and observed L0 growth, using it to inform IO token generation rates). For each of the arguments above: - The roachpb.TenantID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation). - The roachpb.StoreID to find the right store work queue on multi-store nodes. We'll also use the StoreID when informing the origin node of log entries being admitted[^1]. - We pass in the roachpb.RangeID on behalf of which work is being admitted. This, along side the raftpb.Entry.{Term,Index} for the replicated write uniquely identifies where the write is to end up. We use these identifiers to return flow tokens on the origin node[^1][^2]. - For standard work queue ordering, our work item needs to include the CreateTime and AdmissionPriority, details that are passed down using dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry parameter above. - Since the raftpb.Entry encodes within it its origin node[^4], it will be used post-admission to dispatch flow tokens to the right node. This integration is left to future PRs. We use the above to populate the following fields on a per-(replicated write)work basis: ```go // ReplicatedWorkInfo groups everything needed to admit replicated // writes, done so asynchronously below-raft as part of replication // admission control. type ReplicatedWorkInfo struct { RangeID roachpb.RangeID Origin roachpb.NodeID LogPosition LogPosition Ingested bool } ``` Since admission is happening below-raft where the size of the write is known, we no longer need per-work estimates for upfront IO token deductions. Since admission is asynchronous, we also don't use the AdmittedWorkDone interface which was to make token adjustments (without blocking) given the upfront estimates. We still want to intercept the exact point when some write work gets admitted in order to inform the origin node so it can release flow tokens. We do so through the following interface satisfied by the StoreWorkQueue: ```go // onAdmittedReplicatedWork is used to intercept the // point-of-admission for replicated writes. type onAdmittedReplicatedWork interface { admittedReplicatedWork( tenantID roachpb.TenantID, pri admissionpb.WorkPriority, rwi ReplicatedWorkInfo, requestedTokens int64, ) } ``` [^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in #95637. [^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor} introduced in #95637. Token deductions and returns are tied to raft log positions. [^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in #95748. [^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in #95637. Release note: None 98419: clusterversion: add a gate for new system privileges r=jayshrivastava a=rafiss A 22.2/23.1 mixed version cluster cannot handle new system privileges well. This commit gates their usage and adds a test. Without this gate, the included test would fail and users would not be able to log in to nodes running on the old binary. Epic: None Release note: None 98495: settingswatcher: version guard support for clusters bootstrapped at old versions r=JeffSwenson a=JeffSwenson When a cluster is bootstrapping, the sql server is initialized before the cluster version is populated in the DB. Previously, the version guard utility was unable to handle this state if the version is older than the maxVersion used to initialize the version guard. Now, the versionGuard handles this bootstrapping state by falling back on the in-memory cluster version. Part of #94843 Release note: none Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Jeff <[email protected]>

irfansharif requested review from tbg, sumeerbhola and a team January 24, 2023 14:50

irfansharif force-pushed the 230124.raftlog-encodings branch from 10ec0cd to 4bcb8c7 Compare January 24, 2023 14:52

irfansharif mentioned this pull request Jan 24, 2023

*: implement replication admission control #95563

Closed

10 tasks

irfansharif force-pushed the 230124.raftlog-encodings branch from 4bcb8c7 to 561b905 Compare January 25, 2023 06:01

irfansharif mentioned this pull request Jan 25, 2023

upgrades: fix TestKeyVisualizerTablesMigration #95821

Closed

irfansharif force-pushed the 230124.raftlog-encodings branch from 561b905 to 7913c55 Compare January 25, 2023 13:02

irfansharif requested a review from a team January 25, 2023 13:02

irfansharif mentioned this pull request Jan 26, 2023

kvflowcontroller: implement kvflowcontrol.Controller #95905

Merged

tbg approved these changes Jan 26, 2023

View reviewed changes

pkg/clusterversion/cockroach_versions.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/raftlog/encoding.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/raftlog/encoding.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/raftlog/encoding.go Show resolved Hide resolved

irfansharif force-pushed the 230124.raftlog-encodings branch from 7913c55 to 4df47f5 Compare January 26, 2023 17:41

irfansharif commented Jan 26, 2023

View reviewed changes

craig bot merged commit 705d6a1 into cockroachdb:master Jan 26, 2023

irfansharif deleted the 230124.raftlog-encodings branch January 26, 2023 19:03

irfansharif mentioned this pull request Feb 23, 2023

admission: support non-blocking {Store,}WorkQueue.Admit() #97599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

irfansharif commented Jan 24, 2023 •

edited

Loading

cockroach-teamcity commented Jan 24, 2023

tbg left a comment

irfansharif left a comment

craig bot commented Jan 26, 2023

raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

Conversation

irfansharif commented Jan 24, 2023 • edited Loading

Footnotes

cockroach-teamcity commented Jan 24, 2023

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

craig bot commented Jan 26, 2023

irfansharif commented Jan 24, 2023 •

edited

Loading