-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: perturbation/metamorphic/decommission failed #135241
Comments
This was caused by CPU admission control on n10 with the system waiting on the Here is a link to the stats at the time. The CPU was reasonable (~50%), the goroutine count was low and the goroutine scheduler latency wass also low: Attached is a profile of a slow operation: 2024-11-15T07_15_40Z-UPSERT-5796.155ms.zip I think the reason for this slowdown was because the system was so "overprovisioned" for this workload. The config is:
|
I am not familiar with test so I am providing a purely AC overload analysis below. Hopefully this helps in investigating the issue.
What evidence did we have for CPU AC kicking in?
I can see that in the metrics as well. The store queue is for stores (i.e. IO overload). Correlates well with IO overloadStore work queue has waiting requests. Substantial spike in L0 sublevels Compaction queue building up Now looking into logs. IO load listener
My take here is that we are seeing IO overload due to the growth of L0 (see metrics above). And the reason for the high growth seems to replication writes that are bypassing AC (from logs). Example: Not familiar with the test to know if we expect to have so many bypassed writes. But if these writes are bypassing AC, either an issue with the integration of RAC or the workload is overloading the system. One side thing to rule out would be bandwidth saturation but I doubt that is the case given this is GCP cluster (high provisioned bandwidth by default) and we have evidence of high bypassed writes. @andrewbaptist Let me know if this is helpful. It seems to be the classic case of replicated writes overloading the store. |
Thanks for the clarification on the AC metric, I had misread it. You are correct that it is due to IO overload. To clarify what this test does, it determines what the "50% usage" of the cluster is and runs a constant workload expecting consistent throughput and latency while it makes a change. The rate for this cluster was 134,832 requests/sec. It then runs a decommission of the node. The decommission runs from However here we see it drops quite a bit by about a factor of 4, because some requests experience significant AC delay. What I don't understand is why this test is causing IO overload at all. I expected the snapshots will go straight to L6. The expectation if the system can't handle the incoming rate would be to either:
I will take a bit more of a look this afternoon to see why the LSM is getting so inverted. |
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ e83bc46aa42f2476b4b11b9703b8038c660dc980:
Parameters:
|
Following our slack discussion. I want to wrap up the open thread here. The TLDR is that we saw L0 sublevel growth simply because bytes into L0 > compacted bytes out of L0. We see from metrics and logs that there were no snapshots being ingested into L0, so pacing snapshots would do nothing to help here. A quick mitigation here is to increase compaction concurrency for this test using the env variable. Some more details about the above:
Ideally, we will have replication AC for regular traffic. In the meantime, my recommendation for you is to increase the compaction concurrency for this test since we have ample CPU. I am going to remove the AC assignment from this issue. There is nothing actionable on AC that could be done here. Other than having RACv2 for regular traffic, which is tracked separately. |
Marking as a c-bug/a-testing based on the above comment which suggests increasing the compaction concurrency (configuration related). |
Adding a link to #74697 since I don't see a more general story for auto-tuning compaction concurrency. Also a link to the current guidance: https://www.cockroachlabs.com/docs/stable/architecture/storage-layer#compaction This would be hard to do metamorphically unless there was some more clear guidance on how this should be tuned. And if we have that guidance why wouldn't we encode this in the system rather than in the test framework. I'll watch for additional failures on this test and try and get a set of workarounds for different hardware configurations. |
It's like manually tuning any rate. If there is ample room, the guidance is to keep increasing until the desired effect is reached and you dont over utilize the CPU. We keep the base low to avoid over utilizing the CPU.
Thanks! We have actively started working on a possible solution to the problem. @itsbilal you might find this interesting as you are starting to look at a design and prototype for such a case. |
I think this is a more current version of it cockroachdb/pebble#1329. |
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 844d7637f3b4dc1275e8aa05c1cf3bbb1f59f8eb:
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ d18eb683b2759fd8814dacf0baa913f596074a17:
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ e8ee6f574ddf1fce1a4cb53f392c5a9baf633b76:
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ b63a736c85cfc1a968b74863b7f99c89ddebc1d3:
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 58422b11aa3ceb11dd5a382a2236d985517e1506:
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/496.
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/499.
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:
Parameters:
|
roachtest.perturbation/metamorphic/decommission failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=32
encrypted=false
fs=ext4
localSSD=true
runtimeAssertionsBuild=false
ssd=2
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-44411
The text was updated successfully, but these errors were encountered: