Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: admission-control/disk-bandwidth-limiter failed #129534

Closed
cockroach-teamcity opened this issue Aug 23, 2024 · 11 comments · Fixed by #131384
Closed

roachtest: admission-control/disk-bandwidth-limiter failed #129534

cockroach-teamcity opened this issue Aug 23, 2024 · 11 comments · Fixed by #131384
Assignees
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-admission-control Admission Control T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 23, 2024

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ c57e04527fbe285402bcadb7f73ce559e85d0c27:

(admission_control_disk_bandwidth_overload.go:219).4: write bandwidth 90.330391 over last exceeded threshold
(cluster.go:2431).Run: context canceled
(cluster.go:2431).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-41577

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels Aug 23, 2024
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Aug 23, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ a8f64695f5d025a3347c27ed03ee0e36fbdaacd3:

(admission_control_disk_bandwidth_overload.go:219).4: write bandwidth 82.658516 over last exceeded threshold
(cluster.go:2436).Run: context canceled
(cluster.go:2436).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 9e48858c6c8b22af4ec1159bcff6e233e7bfddff:

(admission_control_disk_bandwidth_overload.go:219).4: write bandwidth 81.696953 over last exceeded threshold
(cluster.go:2436).Run: context canceled
(cluster.go:2436).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@aadityasondhi aadityasondhi self-assigned this Aug 26, 2024
@aadityasondhi aadityasondhi added A-admission-control T-admission-control Admission Control and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 26, 2024
@aadityasondhi
Copy link
Collaborator

Not a release blocker. Seems like we may need to be more forgiving in the assertions until the new bandwidth limiter lands #129005.

@aadityasondhi aadityasondhi added the P-2 Issues/test failures with a fix SLA of 3 months label Sep 3, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 7571549d8772a087ab4577d1e770e02582ba5877:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 81.217344 over last exceeded threshold
(cluster.go:2451).Run: context canceled
(cluster.go:2451).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 128fcab4c07413513a05aea1d1494943f4bc3092:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 78.817266 over last exceeded threshold
(cluster.go:2451).Run: context canceled
(cluster.go:2451).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ cba4e7f1fb6a3963c302cc82a58c42da67adc613:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 79.510469 over last exceeded threshold
(cluster.go:2473).Run: context canceled
(cluster.go:2473).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ c6122f6e6f0d35249a5eef8cab22db49dc43a626:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 80.936406 over last exceeded threshold
(cluster.go:2473).Run: context canceled
(cluster.go:2473).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

@aadityasondhi
Copy link
Collaborator

This test might need some tweaks, logs suggest that the foreground workload is exhausting all the tokens and still going over the provisioned value. Looking the last few io_load_listener logs before the cluster was killed. Note that elastic token usage is 0.

teamcity-16965666-1726811391-05-n2cpu8-0001> I240920 06:10:08.336601 471 util/admission/io_load_listener.go:670 ⋮ [T1,Vsystem,n1,s1] 869 IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 57 MiB (write 57 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 17193 (0 bypassed) with 64 MiB acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + 57 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 0.89x+1 B (smoothed 1.02x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 20.15x+1 B (smoothed 17.48x+1 B) + at-admission-tokens 3.9 KiB, compacted 61 MiB [≈68 MiB], flushed 1.9 GiB [≈498 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.21, tokensUsed (elastic 0 B, regular 1.1 GiB) tokens (write 900 MiB (prev 900 MiB)), writeBW 76 MiB/s, readBW 0 B/s, provisioned 75 MiB/s)

teamcity-16965666-1726811391-05-n2cpu8-0001> I240920 06:10:23.336815 471 util/admission/io_load_listener.go:670 ⋮ [T1,Vsystem,n1,s1] 872 IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 58 MiB (write 58 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 17841 (0 bypassed) with 67 MiB acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + 58 MiB adjusted-LSM-writes + 1.1 GiB adjusted-disk-writes + write-model 0.86x+1 B (smoothed 0.94x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 19.26x+1 B (smoothed 18.37x+1 B) + at-admission-tokens 3.6 KiB, compacted 58 MiB [≈63 MiB], flushed 2.7 GiB [≈498 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.33, tokensUsed (elastic 0 B, regular 1.2 GiB) tokens (write 900 MiB (prev 900 MiB)), writeBW 74 MiB/s, readBW 0 B/s, provisioned 75 MiB/s)

teamcity-16965666-1726811391-05-n2cpu8-0001> I240920 06:10:38.337065 471 util/admission/io_load_listener.go:670 ⋮ [T1,Vsystem,n1,s1] 873 IO overload: compaction score 0.050 (13 ssts, 1 sub-levels), L0 growth 85 MiB (write 85 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 16829 (0 bypassed) with 63 MiB acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + 85 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 1.36x+1 B (smoothed 1.15x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 13.93x+1 B (smoothed 16.15x+1 B) + at-admission-tokens 4.4 KiB, compacted 57 MiB [≈60 MiB], flushed 2.3 GiB [≈498 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.20, tokensUsed (elastic 0 B, regular 1.1 GiB) tokens (write 900 MiB (prev 900 MiB)), writeBW 79 MiB/s, readBW 0 B/s, provisioned 75 MiB/s)

teamcity-16965666-1726811391-05-n2cpu8-0001> I240920 06:10:53.337254 471 util/admission/io_load_listener.go:670 ⋮ [T1,Vsystem,n1,s1] 876 IO overload: compaction score 0.050 (5 ssts, 1 sub-levels), L0 growth 57 MiB (write 57 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 15439 (0 bypassed) with 58 MiB acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + 57 MiB adjusted-LSM-writes + 1.2 GiB adjusted-disk-writes + write-model 0.99x+1 B (smoothed 1.07x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 22.31x+1 B (smoothed 19.23x+1 B) + at-admission-tokens 4.1 KiB, compacted 79 MiB [≈69 MiB], flushed 2.1 GiB [≈498 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.20, tokensUsed (elastic 0 B, regular 1.1 GiB) tokens (write 900 MiB (prev 900 MiB)), writeBW 85 MiB/s, readBW 0 B/s, provisioned 75 MiB/s)

teamcity-16965666-1726811391-05-n2cpu8-0001> I240920 06:11:08.337483 471 util/admission/io_load_listener.go:670 ⋮ [T1,Vsystem,n1,s1] 881 IO overload: compaction score 0.000 (0 ssts, 0 sub-levels), L0 growth 57 MiB (write 57 MiB (ignored 0 B) ingest 0 B (ignored 0 B)): requests 15044 (0 bypassed) with 56 MiB acc-write (0 B bypassed) + 0 B acc-ingest (0 B bypassed) + 57 MiB adjusted-LSM-writes + 1.3 GiB adjusted-disk-writes + write-model 1.01x+1 B (smoothed 1.04x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + write-amp-model 22.72x+1 B (smoothed 20.97x+1 B) + at-admission-tokens 4.0 KiB, compacted 63 MiB [≈66 MiB], flushed 2.0 GiB [≈498 MiB] (mult 1.00); admitting all; write stalls 0; diskBandwidthLimiter (tokenUtilization 1.28, tokensUsed (elastic 0 B, regular 1.1 GiB) tokens (write 900 MiB (prev 900 MiB)), writeBW 86 MiB/s, readBW 0 B/s, provisioned 75 MiB/s)

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 83589fb87caa92fb42e83994f1691978f37e4cbb:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 81.189029 over last exceeded threshold
(cluster.go:2473).Run: context canceled
(cluster.go:2473).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 41084720464c4144f64d9ddcb46508b4d762c4e8:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 78.990625 over last exceeded threshold
(cluster.go:2473).Run: context canceled
(cluster.go:2473).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Sep 25, 2024
This patch fixes a few things in this test:
- Runs the first step longer to have a fuller LSM to induce block and
  page cache misses to have some disk reads.
- Reduces the throughput of the foreground workload since it was causing
  saturation on its own.
- Assert on total bandwidth since the disk bandwidth limiter should be
  accounting for reads when determining tokens.

Fixes cockroachdb#129534.

Release note: None
@cockroach-teamcity
Copy link
Member Author

roachtest.admission-control/disk-bandwidth-limiter failed with artifacts on master @ 8e6e4090457565a41bc3bd8ea954e437030d1c49:

(admission_control_disk_bandwidth_overload.go:183).3: write bandwidth 85.611094 over last exceeded threshold
(cluster.go:2473).Run: context canceled
(cluster.go:2473).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/admission-control/disk-bandwidth-limiter/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=aws
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Sep 26, 2024
131093: storage: disable checkUncertainty on failOnMoreRecent in scanner r=tbg a=tbg

It was possible for reads with failOnMoreRecent to hit a
ReadWithinUncertaintyIntervalError instead of the desired
WriteTooOldError. This commit disables uncertainty checks when
failOnMoreRecent is active, as the latter is a stronger check anyway.

Fixes #119681.
Fixes #131005.

Epic: none
Release note: None

131384: roachtest: admission-control/disk-bandwidth-limiter test improvements r=sumeerbhola a=aadityasondhi

This patch fixes a few things in this test:
- Runs the first step longer to have a fuller LSM to induce block and page cache misses to have some disk reads.
- Reduces the throughput of the foreground workload since it was causing saturation on its own.
- Assert on total bandwidth since the disk bandwidth limiter should be accounting for reads when determining tokens.

Fixes #129534.

Release note: None

131395: crosscluster/producer: modify lastEmitWait and lastProduceWait computation r=dt a=msbutler

This patch modifies the lastEmitWait and lastProduceWait in the crdb_internal.cluster_replication_node streams vtable to be either the current wait or previous wait, if the event stream is currently waiting on that given state.

Epic: none

Release note: none

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in #131384 Sep 26, 2024
@craig craig bot closed this as completed in 58e396c Sep 26, 2024
@github-project-automation github-project-automation bot moved this from Tests (failures, skipped, flakes) to Done in [Deprecated] Storage Sep 26, 2024
cthumuluru-crdb pushed a commit to cthumuluru-crdb/cockroach that referenced this issue Oct 1, 2024
This patch fixes a few things in this test:
- Runs the first step longer to have a fuller LSM to induce block and
  page cache misses to have some disk reads.
- Reduces the throughput of the foreground workload since it was causing
  saturation on its own.
- Assert on total bandwidth since the disk bandwidth limiter should be
  accounting for reads when determining tokens.

Fixes cockroachdb#129534.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-admission-control Admission Control T-storage Storage Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants