Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvflowcontrol,kvserver: fast follow-on work #104154

Closed
6 of 14 tasks
irfansharif opened this issue May 31, 2023 · 1 comment
Closed
6 of 14 tasks

kvflowcontrol,kvserver: fast follow-on work #104154

irfansharif opened this issue May 31, 2023 · 1 comment
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control

Comments

@irfansharif
Copy link
Contributor

irfansharif commented May 31, 2023

Is your feature request related to a problem? Please describe.

This issue tracks deferred work from PRs that introduced replication admission control (#95563, #98308). This list is sourced partly using checked-in TODOs ($ rg "#95563" --before-context 4).

  • kvserver: remove above-raft throttling for AddSST #102683
  • Export metric to track total AddSST bandwidth, both proposed and applied (this was prototyped in #93102, addsstable.{proposal,application}_bytes)
    • The aggregated elastic flow tokens deducted metric works well as a substitute
  • kvflowcontroller: mutex contention #105508
    • TODO: Eliminate heap allocation in kvflowcontrol.ContextWithMeta
    • TODO: Write microbenchmark for allocation profile of WorkQueue.Admit, and optionally improve it.
  • TODO: Eliminate OOM possibility when kvadmission.flow_control.mode = apply_to_elastic.
  • TODO: Add a byte/count limit to how many flow token dispatches can be added to a raft message
  • Surface some kvflowcontrol metrics in the admin UI.
  • Address remaining performance cuts identified in kvflowcontrol: annotate/fix perf regressions #109833.
  • Dump vtable state in debug.zip
  • Don't use flow tokens (or AC really) for lease requests (see this comment).
  • Don't use flow tokens or AC when writing to system tables (especially around writes to cluster settings)
  • TODO: Fix over-admission due to AC token acquisition fast path.
  • TODO: Remove O(local stores) lookup for flow control handles.

Epic CRDB-25348

Jira issue: CRDB-28374

Epic CRDB-25348

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control T-kv KV Team labels May 31, 2023
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 12, 2023
Enable kvadmission.flow_control.enabled by default. We didn't observe
noticeable performance regressions while it was disabled (single weekly
run, three nightly runs). There was some minimal fallout that was since
fixed (cockroachdb#104699). We expect performance regressions now that this commit
enables it by default, and expect more fallout. We'll handle these as
part of cockroachdb#104154.

Release note: None
craig bot pushed a commit that referenced this issue Jun 12, 2023
104741: kvflowcontrol: enable by default r=irfansharif a=irfansharif

Enable kvadmission.flow_control.enabled by default. We didn't observe noticeable performance regressions while it was disabled (single weekly run, three nightly runs). There was some minimal fallout that was since fixed (#104699). We expect performance regressions now that this commit enables it by default, and expect more fallout. We'll handle these as part of #104154.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jun 14, 2023
Fixes cockroachdb#102683. Part of cockroachdb#104154.

These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of
IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO
admission control for follower writes. Put together, have ample
LSM read-amp protection through AC alone. These concurrency limiters are
now redundant and oblivious to more sophisticated AC measures. We
recently removed the below-raft equivalents of these limiters (cockroachdb#98762),
and like mentioned there, these limiters can exacerbate memory pressure.
Separately, we're looking to work on speedier restores, and these
limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble,
which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in
\cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and
this pre-ingest delay with its maximum per-request delay time of 5s can
be less than effective. It's worth noting that the L0 file count
threshold at which this pre-ingest delay mechanism kicked in was 20,
while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright,
merely disabling them. This is just out of an overabundance of caution.
We can probably remove them once kvflowcontrol.enabled has had >1
release worth of baking time. Until then, it's nice to know we have
these old safety hatches. We have ample time in the release to assess
fallout from this commit, and also use this increased AddSST concurrency
to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0
      completely. Envelope math cribbed from elsewhere: With 2MiB files,
      1000 files is ~2GB, which at 40MB/s of compaction throughput (with
      a compaction slot consistently dedicated to L0) takes < 60s to
      clear the backlog. So the 'recovery' time is modest in that
      operators should not need to take manual action

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jul 4, 2023
Fixes cockroachdb#102683. Part of cockroachdb#104154.

These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of
IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO
admission control for follower writes. Put together, have ample
LSM read-amp protection through AC alone. These concurrency limiters are
now redundant and oblivious to more sophisticated AC measures. We
recently removed the below-raft equivalents of these limiters (cockroachdb#98762),
and like mentioned there, these limiters can exacerbate memory pressure.
Separately, we're looking to work on speedier restores, and these
limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble,
which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in
\cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and
this pre-ingest delay with its maximum per-request delay time of 5s can
be less than effective. It's worth noting that the L0 file count
threshold at which this pre-ingest delay mechanism kicked in was 20,
while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright,
merely disabling them. This is just out of an overabundance of caution.
We can probably remove them once kvflowcontrol.enabled has had >1
release worth of baking time. Until then, it's nice to know we have
these old safety hatches. We have ample time in the release to assess
fallout from this commit, and also use this increased AddSST concurrency
to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0
      completely. Envelope math cribbed from elsewhere: With 2MiB files,
      1000 files is ~2GB, which at 40MB/s of compaction throughput (with
      a compaction slot consistently dedicated to L0) takes < 60s to
      clear the backlog. So the 'recovery' time is modest in that
      operators should not need to take manual action

Release note: None
craig bot pushed a commit that referenced this issue Jul 4, 2023
104861: kvserver: disable pre-AC above-raft AddSST throttling r=irfansharif a=irfansharif

Fixes #102683. Part of #104154.

These were added way back in #36403 and #73904, pre-dating much of IO admission control for leaseholder writes. With #95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in #34258 for RocksDB and in \#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Aug 24, 2023
Previously, when we were skipping flow control for regular work, we were
still encoding raft messages with AC encoding. This would enqueue
essentially no-op AC work items below raft, increasing the possibility
of OOM.

This change skips encoding Raft messages with AC encoding for cases
where we skip flow control. Additionally, we still subject these work
items to above Raft AC if it is enabled.

informs cockroachdb#104154

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Aug 29, 2023
Previously, when we were skipping flow control for regular work, we were
still encoding raft messages with AC encoding. This would enqueue
essentially no-op AC work items below raft, increasing the possibility
of OOM. This is because we would not be pacing the rate of regular work
by below-raft admission rates but it would be consuming memory in
below-raft work queues.

This change skips encoding raft messages with AC encoding for cases
where we bypass flow control. Additionally, we still subject such work
to above-raft IO admission control, if enabled.

Informs cockroachdb#104154.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Aug 30, 2023
Previously, when we were skipping flow control for regular work, we were
still encoding raft messages with AC encoding. This would enqueue
essentially no-op AC work items below raft, increasing the possibility
of OOM. This is because we would not be pacing the rate of regular work
by below-raft admission rates but it would be consuming memory in
below-raft work queues.

This change skips encoding raft messages with AC encoding for cases
where we bypass flow control. Additionally, we still subject such work
to above-raft IO admission control, if enabled.

Informs cockroachdb#104154.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Aug 30, 2023
Previously, when we were skipping flow control for regular work, we were
still encoding raft messages with AC encoding. This would enqueue
essentially no-op AC work items below raft, increasing the possibility
of OOM. This is because we would not be pacing the rate of regular work
by below-raft admission rates but it would be consuming memory in
below-raft work queues.

This change skips encoding raft messages with AC encoding for cases
where we bypass flow control. Additionally, we still subject such work
to above-raft IO admission control, if enabled.

Informs cockroachdb#104154.

Release note: None
craig bot pushed a commit that referenced this issue Aug 30, 2023
109446: kvadmission: fix handling of non-elastic work with flow control r=irfansharif a=aadityasondhi

Previously, when we were skipping flow control for regular work, we were
still encoding raft messages with AC encoding. This would enqueue
essentially no-op AC work items below raft, increasing the possibility
of OOM. This is because we would not be pacing the rate of regular work
by below-raft admission rates but it would be consuming memory in
below-raft work queues.

This change skips encoding raft messages with AC encoding for cases
where we bypass flow control. Additionally, we still subject such work
to above-raft IO admission control, if enabled.

Informs #104154.

Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 19, 2023
This patch makes 2 main performance optimizations in the tracker
codepath:
1. Replace the trackedList with a struct that points to a list. This
   avoids map assignments each time anything is appended to the list.
2. We use a sync.Pool for tracked items to avoid frequent allocations
   and gc for the objects.

Informs cockroachdb#104154.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 21, 2023
This patch removes the deletion of empty `trackedList` to resuse it in
future runs. Based on benchmarks this helps reduce the heap allocations
in the package.

On `kv0/enc=false/nodes=3/cpu=96` heap profiles, this brings it down to
0.1% from the original 0.3% and much lower than the 0.77% of the
previous commit.

Informs cockroachdb#104154.

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Oct 2, 2023
In `kv0/enc=false/nodes=3/cpu=96`, we noticed mutex contention around
the `outbox` map. This patch tries to alleviate that by moving the mutex
down into each individual dispatch map (sharding by NodeID).

Informs: cockroachdb#104154.

Release note: None
@aadityasondhi aadityasondhi added T-admission-control Admission Control and removed T-kv KV Team labels Oct 3, 2023
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Oct 3, 2023
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Oct 3, 2023
In `kv0/enc=false/nodes=3/cpu=96`, we noticed mutex contention around
the `outbox` map. This patch tries to alleviate that by moving the mutex
down into each individual dispatch map (sharding by NodeID).

Informs: cockroachdb#104154.

Release note: None
@aadityasondhi
Copy link
Collaborator

Discussed offline with @sumeerbhola, closing this issue as the remaining tweaks here are those we don't want to invest time into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants