Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftlog: introduce EntryEncoding{Standard,Sideloaded}WithAC #95748

Merged
merged 1 commit into from
Jan 26, 2023

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Jan 24, 2023

Part of #95563. Predecessor to #95637.

This commit introduces two new encodings for raft log entries, EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix byte that informs decoding routines how to interpret the subsequent bytes. To date we've had two, EntryEncoding{Standard,Sideloaded}1, to indicate whether the entry came with sideloaded data2. Our two additions here will be used to indicate whether the particular entry is subject to replication admission control. If so, right as we persist entries into the raft log storage, we'll "admit the work without blocking", which is further explained in #95637.

The decision to use replication admission control happens above raft and a per-entry basis. If using replication admission control, AC-specific metadata will be plumbed down as part of the marshaled raft command. This too is explained in in #95637, specifically, the 'RaftAdmissionMeta' section. This commit then adds an unused version gate (V23_1UseEncodingWithBelowRaftAdmissionData) to use replication admission control. Since we're using a different prefix byte for raft commands, one not recognized in earlier CRDB versions, we need explicit versioning. We add it out of development convenience -- adding version gates is most prone to merge conflicts. We expect to use it shortly, before alpha/beta cuts.

Release note: None

Footnotes

  1. Now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC.

  2. These are typically AddSSTs, the storage for which is treated differently for performance reasons.

@irfansharif irfansharif requested review from tbg, sumeerbhola and a team January 24, 2023 14:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 230124.raftlog-encodings branch from 10ec0cd to 4bcb8c7 Compare January 24, 2023 14:52
@irfansharif irfansharif force-pushed the 230124.raftlog-encodings branch from 4bcb8c7 to 561b905 Compare January 25, 2023 06:01
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 25, 2023
This test started failing for commits that introduced additional version
gates, like in cockroachdb#95748. It's because we were not overriding the binary
version the server started off with (it defaults to the last added
version). It failed with:

  Error Trace:	pkg/upgrade/upgrades/helpers_test.go:67
                pkg/upgrade/upgrades/helpers_test.go:50
                pkg/upgrade/upgrades/key_visualizer_migration_test.go:48
  Error:      	pq: versions cannot be downgraded (attempting to downgrade from 1000022.2-34 to 1000022.2-32)

Release note: None
@irfansharif irfansharif force-pushed the 230124.raftlog-encodings branch from 561b905 to 7913c55 Compare January 25, 2023 13:02
@irfansharif irfansharif requested a review from a team January 25, 2023 13:02
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod the open comments.

pkg/clusterversion/cockroach_versions.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/raftlog/encoding.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/raftlog/encoding.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/raftlog/encoding.go Show resolved Hide resolved
Part of cockroachdb#95563. Predecessor to cockroachdb#95637.

This commit introduces two new encodings for raft log entries,
EntryEncoding{Standard,Sideloaded}WithAC. Raft log entries have prefix
byte that informs decoding routines how to interpret the subsequent
bytes. To date we've had two, EntryEncoding{Standard,Sideloaded}[^1], to
indicate whether the entry came with sideloaded data[^2]. Our two
additions here will be used to indicate whether the particular entry is
subject to replication admission control. If so, right as we persist
entries into the raft log storage, we'll "admit the work without
blocking", which is further explained in cockroachdb#95637.

The decision to use replication admission control happens above raft and
a per-entry basis. If using replication admission control, AC-specific
metadata will be plumbed down as part of the marshaled raft command.
This too is explained in in cockroachdb#95637, specifically, the
'RaftAdmissionMeta' section. When using these encodings in the future,
we'll need to tied it to a version gate since we're using a prefix byte
for raft commands one that's recognized in earlier CRDB versions.

[^1]: Now renamed to EntryEncoding{Standard,Sideloaded}WithoutAC.
[^2]: These are typically AddSSTs, the storage for which is treated
      differently for performance reasons.

Release note: None
@irfansharif irfansharif force-pushed the 230124.raftlog-encodings branch from 7913c55 to 4df47f5 Compare January 26, 2023 17:41
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, removed the version gate + responded in line.

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 26, 2023

Build succeeded:

@craig craig bot merged commit 705d6a1 into cockroachdb:master Jan 26, 2023
@irfansharif irfansharif deleted the 230124.raftlog-encodings branch January 26, 2023 19:03
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Feb 25, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models (that map between
accounted-for writes and observed L0 growth, using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to:
  - Return flow tokens on the origin node[^1][^2].
  - In WorkQueue ordering -- for replicated writes below-raft, we ignore
    CreateTime/epoch-LIFO, and instead sort by priority and within a
    priority, sort by log position.
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Feb 27, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models (that map between
accounted-for writes and observed L0 growth, using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to:
  - Return flow tokens on the origin node[^1][^2].
  - In WorkQueue ordering -- for replicated writes below-raft, we ignore
    CreateTime/epoch-LIFO, and instead sort by priority and within a
    priority, sort by log position.
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Feb 27, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models (that map between
accounted-for writes and observed L0 growth, using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to:
  - Return flow tokens on the origin node[^1][^2].
  - In WorkQueue ordering -- for replicated writes below-raft, we ignore
    CreateTime/epoch-LIFO, and instead sort by priority and within a
    priority, sort by log position.
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 8, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models (that map between
accounted-for writes and observed L0 growth, using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to:
  - Return flow tokens on the origin node[^1][^2].
  - In WorkQueue ordering -- for replicated writes below-raft, we ignore
    CreateTime/epoch-LIFO, and instead sort by priority and within a
    priority, sort by log position.
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 9, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models (that map between
accounted-for writes and observed L0 growth, using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to:
  - Return flow tokens on the origin node[^1][^2].
  - In WorkQueue ordering -- for replicated writes below-raft, we ignore
    CreateTime/epoch-LIFO, and instead sort by priority and within a
    priority, sort by log position.
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 9, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models (that map between
accounted-for writes and observed L0 growth, using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to:
  - Return flow tokens on the origin node[^1][^2].
  - In WorkQueue ordering -- for replicated writes below-raft, we ignore
    CreateTime/epoch-LIFO, and instead sort by priority and within a
    priority, sort by log position.
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 10, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models that map between
accounted-for writes and observed L0 growth (using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to return flow tokens on the origin
  node[^1][^2].
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 13, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models that map between
accounted-for writes and observed L0 growth (using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to return flow tokens on the origin
  node[^1][^2].
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 13, 2023
Part of cockroachdb#95563. For end-to-end flow control of replicated writes, we
want to enable below-raft admission control through the following API on
kvadmission.Controller:

  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )

This serves as the integration point for log entries received below
raft right as they're being written to stable storage. It's a
non-blocking interface since we're below-raft and in the raft.Ready()
loop. What it effectively does is enqueues a "virtual" work item in the
underlying StoreWorkQueue mediating all store IO. This virtual work item
is what later gets dequeued once the IO granter informs the work queue
of newly available IO tokens. When enqueueing the virtual work item, we
still update the StoreWorkQueue's physically-accounted-for bytes since
the actual write is not deferred, and timely statistic updates improves
accuracy for the underlying linear models that map between
accounted-for writes and observed L0 growth (using it to inform IO token
generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue
  it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store
  nodes. We'll also use the StoreID when informing the origin node of
  log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being
  admitted. This, along side the raftpb.Entry.{Term,Index} for the
  replicated write uniquely identifies where the write is to end up.
  We use these identifiers to return flow tokens on the origin
  node[^1][^2].
- For standard work queue ordering, our work item needs to include the
  CreateTime and AdmissionPriority, details that are passed down using
  dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry
  parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it
    will be used post-admission to dispatch flow tokens to the right
    node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated
write)work basis:

    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }

Since admission is happening below-raft where the size of the write is
known, we no longer need per-work estimates for upfront IO token
deductions. Since admission is asynchronous, we also don't use
the AdmittedWorkDone interface which was to make token adjustments
(without blocking) given the upfront estimates. We still want to
intercept the exact point when some write work gets admitted in order to
inform the origin node so it can release flow tokens. We do so through
the following interface satisfied by the StoreWorkQueue:

  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in cockroachdb#95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor}
      introduced in cockroachdb#95637. Token deductions and returns are tied to
      raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in
      cockroachdb#95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in cockroachdb#95637.
        message RaftAdmissionMeta {
          int32 admission_priority = ...;
          int64 admission_create_time = ...;
          int32 admission_origin_node = ...;
        }

Release note: None
craig bot pushed a commit that referenced this pull request Mar 13, 2023
97599: admission: support non-blocking {Store,}WorkQueue.Admit() r=irfansharif a=irfansharif

Part of #95563. For end-to-end flow control of replicated writes, we want to enable below-raft admission control through the following API on kvadmission.Controller:
```go
  // AdmitRaftEntry informs admission control of a raft log entry being
  // written to storage (for the given tenant, the specific range, and
  // on the named store).
  AdmitRaftEntry(
    context.Context, roachpb.TenantID,
    roachpb.StoreID, roachpb.RangeID, raftpb.Entry.
  )
```
This serves as the integration point for log entries received below raft right as they're being written to stable storage. It's a non-blocking interface since we're below-raft and in the raft.Ready() loop. What it effectively does is enqueues a "virtual" work item in the underlying StoreWorkQueue mediating all store IO. This virtual work item is what later gets dequeued once the IO granter informs the work queue of newly available IO tokens. When enqueueing the virtual work item, we still update the StoreWorkQueue's physically-accounted-for bytes since the actual write is not deferred, and timely statistic updates improves accuracy for the underlying linear models (that map between accounted-for writes and observed L0 growth, using it to inform IO token generation rates).

For each of the arguments above:
- The roachpb.TenantID is plumbed to find the right tenant heap to queue it under (for inter-tenant isolation).
- The roachpb.StoreID to find the right store work queue on multi-store nodes. We'll also use the StoreID when informing the origin node of log entries being admitted[^1].
- We pass in the roachpb.RangeID on behalf of which work is being admitted. This, along side the raftpb.Entry.{Term,Index} for the replicated write uniquely identifies where the write is to end up. We use these identifiers to return flow tokens on the origin node[^1][^2].
- For standard work queue ordering, our work item needs to include the CreateTime and AdmissionPriority, details that are passed down using dedicated raft log entry encodings[^3][^4] as part of the raftpb.Entry parameter above.
  - Since the raftpb.Entry encodes within it its origin node[^4], it will be used post-admission to dispatch flow tokens to the right node. This integration is left to future PRs.

We use the above to populate the following fields on a per-(replicated write)work basis:
```go
    // ReplicatedWorkInfo groups everything needed to admit replicated
    // writes, done so asynchronously below-raft as part of replication
    // admission control.
    type ReplicatedWorkInfo struct {
      RangeID roachpb.RangeID
      Origin roachpb.NodeID
      LogPosition LogPosition
      Ingested bool
    }
```
Since admission is happening below-raft where the size of the write is known, we no longer need per-work estimates for upfront IO token deductions. Since admission is asynchronous, we also don't use the AdmittedWorkDone interface which was to make token adjustments (without blocking) given the upfront estimates. We still want to intercept the exact point when some write work gets admitted in order to inform the origin node so it can release flow tokens. We do so through the following interface satisfied by the StoreWorkQueue:
```go
  // onAdmittedReplicatedWork is used to intercept the
  // point-of-admission for replicated writes.
  type onAdmittedReplicatedWork interface {
    admittedReplicatedWork(
      tenantID roachpb.TenantID,
      pri admissionpb.WorkPriority,
      rwi ReplicatedWorkInfo,
      requestedTokens int64,
    )
  }
```

[^1]: See kvflowcontrolpb.AdmittedRaftLogEntries introduced in #95637.
[^2]: See kvflowcontrol.Handle.{ReturnTokensUpto,DeductTokensFor} introduced in #95637. Token deductions and returns are tied to raft log positions.
[^3]: See raftlog.EntryEncoding{Standard,Sideloaded}WithAC introduced in #95748.
[^4]: See kvflowcontrolpb.RaftAdmissionMeta introduced in #95637.

Release note: None


98419: clusterversion: add a gate for new system privileges r=jayshrivastava a=rafiss

A 22.2/23.1 mixed version cluster cannot handle new system privileges well. This commit gates their usage and adds a test.

Without this gate, the included test would fail and users would not be able to log in to nodes running on the old binary.

Epic: None
Release note: None

98495: settingswatcher: version guard support for clusters bootstrapped at old versions r=JeffSwenson a=JeffSwenson

When a cluster is bootstrapping, the sql server is initialized before the cluster version is populated in the DB. Previously, the version guard utility was unable to handle this state if the version is older than the maxVersion used to initialize the version guard. Now, the versionGuard handles this bootstrapping state by falling back on the in-memory cluster version.

Part of #94843

Release note: none

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Jeff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants