Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: enable disk bandwidth bottleneck resource #86857

Closed
4 tasks done
sumeerbhola opened this issue Aug 25, 2022 · 5 comments
Closed
4 tasks done

admission: enable disk bandwidth bottleneck resource #86857

sumeerbhola opened this issue Aug 25, 2022 · 5 comments
Assignees
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Aug 25, 2022

Followup to #82898 which created the basic infrastructure, configuration scheme, and did experimentation with regular and elastic kv0 traffic.

cc: @irfansharif @andrewbaptist

Jira issue: CRDB-18968

Epic: CRDB-37479

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Aug 25, 2022
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 23, 2023
Part of cockroachdb#86857.

This commit eliminate the need to provide the disk-name common
environments e.g. linux with store on EBS or GCP PD. To make use of AC's
disk bandwidth tokens, users still need to specify the provisioned
bandwidth, for now. So in a sense this machinery is still "disabled by
default". Next steps:

- automatically measure provisioned bandwidth, using something like
  github.com/irfansharif/probe, gate behind envvars or cluster settings;
- add roachtests that make use of these disk bandwidth tokens;
- roll it out in managed environments;
- roll it out elsewhere.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 2, 2023
Part of cockroachdb#86857.

This commit eliminate the need to provide the disk-name common
environments e.g. linux with store on EBS or GCP PD. To make use of AC's
disk bandwidth tokens, users still need to specify the provisioned
bandwidth, for now. So in a sense this machinery is still "disabled by
default". They can also do this through the
kvadmission.store.provisioned_bandwidth cluster setting.

Next steps:

- add roachtests that make use of these disk bandwidth tokens;
- automatically measure provisioned bandwidth, using something like
  github.com/irfansharif/probe, gate behind envvars or cluster settings;
- roll it out in managed environments;
- roll it out elsewhere.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 2, 2023
Part of cockroachdb#86857.

This commit eliminate the need to provide the disk-name common
environments e.g. linux with store on EBS or GCP PD. To make use of AC's
disk bandwidth tokens, users still need to specify the provisioned
bandwidth, for now. So in a sense this machinery is still "disabled by
default". They can also do this through the
kvadmission.store.provisioned_bandwidth cluster setting.

Next steps:

- add roachtests that make use of these disk bandwidth tokens;
- automatically measure provisioned bandwidth, using something like
  github.com/irfansharif/probe, gate behind envvars or cluster settings;
- roll it out in managed environments;
- roll it out elsewhere.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Sep 2, 2023
Integration test for disk bandwidth tokens, copying over what we ran in
\cockroachdb#82813. Part of cockroachdb#86857

Release note: None
@nicktrav nicktrav added the T-admission-control Admission Control label Dec 21, 2023
@sumeerbhola
Copy link
Collaborator Author

We should remember to also extend the disk bandwidth control to encompass disk writes of sstables due to incoming range snapshots.

CheranMahalingam added a commit to CheranMahalingam/cockroach that referenced this issue Mar 25, 2024
This commit cleans up changes from cockroachdb#119885. There is no longer a need for
users to specify a disk name when specifying the provisioned bandwidth
since we can now automatically infer disk names from the `StoreSpec.Path`
and the underlying block device.

Informs: cockroachdb#86857.

Epic: None.
Release note (ops change): The provisioned-rate field, if specified,
should no longer accept a disk-name or an optional bandwidth field. To
use the disk bandwidth constraint the store-spec must contain
provisioned-rate=bandwidth=<bandwidth-bytes/s>, otherwise the cluster
setting kv.store.admission.provisioned_bandwidth will be used.
craig bot pushed a commit that referenced this issue Mar 26, 2024
120895: admission: remove `DiskName` from `StoreSpec.ProvisionedRateSpec` r=sumeerbhola a=CheranMahalingam

This commit cleans up changes from #119885. There is no longer a need for users to specify a disk name when specifying the provisioned bandwidth since we can now automatically infer disk names from the `StoreSpec.Path` and the underlying block device.

Informs: #86857.

Epic: None.
Release note (ops change): The provisioned-rate field, if specified, should no longer accept a disk-name or an optional bandwidth field. To use the disk bandwidth constraint the store-spec must contain provisioned-rate=bandwidth=<bandwidth-bytes/s>, otherwise the cluster setting kv.store.admission.provisioned_bandwidth will be used.

Co-authored-by: Cheran Mahalingam <[email protected]>
@aadityasondhi aadityasondhi self-assigned this Apr 2, 2024
@jbowens
Copy link
Collaborator

jbowens commented Apr 3, 2024

As a part of online restore, we'll need admission control to control downloading of external sstables so that we can download as fast as we can without affecting foreground workload latencies.

aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 11, 2024
Previously, we would calculate elastic disk bandwidth tokens using
arbitrary load thresholds and an estimate on incoming bytes into the LSM
through flushes and ingestions. This calculation lacked accounting for
write amplification in the LSM.

This patch simplifies the disk bandwidth limiter to remove the disk load
watcher and simply adjust tokens using the known provisioned disk
bandwidth. For token deducation, we create a write-amp model that is a
relationship between incoming LSM bytes to actual disk writes.

The token granting semantics are as follows:
- elastic writes: deduct tokens, and wait for positive count in bucket.
- regular writes: deduct tokens, but proceed even with no tokens
  available.

The requested write bytes are "inflated" using the estimated write
amplification to account for async compactions in the LSM.

This patch also lays the framework for future integrations where we can
account for range snapshot ingestions separately as those don't incur
the same write amplification as flushed LSM bytes do.

Informs: cockroachdb#86857

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 17, 2024
Previously, we would calculate elastic disk bandwidth tokens using
arbitrary load thresholds and an estimate on incoming bytes into the LSM
through flushes and ingestions. This calculation lacked accounting for
write amplification in the LSM.

This patch simplifies the disk bandwidth limiter to remove the disk load
watcher and simply adjust tokens using the known provisioned disk
bandwidth. For token deducation, we create a write-amp model that is a
relationship between incoming LSM bytes to actual disk writes.

The token granting semantics are as follows:
- elastic writes: deduct tokens, and wait for positive count in bucket.
- regular writes: deduct tokens, but proceed even with no tokens
  available.

The requested write bytes are "inflated" using the estimated write
amplification to account for async compactions in the LSM.

This patch also lays the framework for future integrations where we can
account for range snapshot ingestions separately as those don't incur
the same write amplification as flushed LSM bytes do.

Informs: cockroachdb#86857

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 17, 2024
Previously, we would calculate elastic disk bandwidth tokens using
arbitrary load thresholds and an estimate on incoming bytes into the LSM
through flushes and ingestions. This calculation lacked accounting for
write amplification in the LSM.

This patch simplifies the disk bandwidth limiter to remove the disk load
watcher and simply adjust tokens using the known provisioned disk
bandwidth. For token deducation, we create a write-amp model that is a
relationship between incoming LSM bytes to actual disk writes.

The token granting semantics are as follows:
- elastic writes: deduct tokens, and wait for positive count in bucket.
- regular writes: deduct tokens, but proceed even with no tokens
  available.

The requested write bytes are "inflated" using the estimated write
amplification to account for async compactions in the LSM.

This patch also lays the framework for future integrations where we can
account for range snapshot ingestions separately as those don't incur
the same write amplification as flushed LSM bytes do.

Informs: cockroachdb#86857

Release note: None
craig bot pushed a commit that referenced this issue Sep 17, 2024
129005: admission: account for write-amp in disk bandwidth limiter r=sumeerbhola a=aadityasondhi

Previously, we would calculate elastic disk bandwidth tokens using
arbitrary load thresholds and an estimate on incoming bytes into the LSM
through flushes and ingestions. This calculation lacked accounting for
write amplification in the LSM.

This patch simplifies the disk bandwidth limiter to remove the disk load
watcher and simply adjust tokens using the known provisioned disk
bandwidth. For token deducation, we create a write-amp model that is a
relationship between incoming LSM bytes to actual disk writes.

The token granting semantics are as follows:
- elastic writes: deduct tokens, and wait for positive count in bucket.
- regular writes: deduct tokens, but proceed even with no tokens
  available.

The requested write bytes are "inflated" using the estimated write
amplification to account for async compactions in the LSM.

This patch also lays the framework for future integrations where we can
account for range snapshot ingestions separately as those don't incur
the same write amplification as flushed LSM bytes do.

Informs: #86857

Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
@dshjoshi
Copy link

@aadityasondhi @nicktrav Is this in progress or done or still in backlog? If it is in progress or done, can we update the status?

@aadityasondhi
Copy link
Collaborator

The functionality should be complete as of #129005.

The remaining work would be to:

  • Enable on internal test clusters (DRT)
  • Enable on cloud clusters
  • Write public docs for users

The enable part means setting the cluster setting or store config flags for the provisioned bandwidth.

aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 12, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 12, 2024
This patch contains some small improvements to better test the bandwidth
subtest of the snapshot ingest roachtest.

Informs cockroachdb#86857

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 12, 2024
This patch contains some small improvements to better test the bandwidth
subtest of the snapshot ingest roachtest.

Informs cockroachdb#86857

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 13, 2024
This patch contains some small improvements to better test the bandwidth
subtest of the snapshot ingest roachtest.

Informs cockroachdb#86857

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 13, 2024
This patch contains some small improvements to better test the bandwidth
subtest of the snapshot ingest roachtest.

Informs cockroachdb#86857

Release note: None
craig bot pushed a commit that referenced this issue Nov 13, 2024
135019: roachtest: use disk stall utility to limit bandwidth in AC tests r=itsbilal a=aadityasondhi

Informs #86857

Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
tbg pushed a commit to tbg/cockroach that referenced this issue Nov 15, 2024
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Nov 26, 2024
This patch contains some small improvements to better test the bandwidth
subtest of the snapshot ingest roachtest.

Informs cockroachdb#86857

Release note: None
craig bot pushed a commit that referenced this issue Nov 26, 2024
135022: roachtest: snapshot ingest roachtest improvements  r=sumeerbhola a=aadityasondhi

This patch contains some small improvements to better test the bandwidth subtest of the snapshot ingest roachtest.

Informs #86857

Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
@aadityasondhi
Copy link
Collaborator

At this point, this is complete. The remaining work to enable in cloud is tracked internally in Jira.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

5 participants