-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: perturbation/full/backfill failed #137093
Comments
@andrewbaptist I took a stab at this based on the debug guide you wrote. This failure is of the type where the latency during the perturbation period has exceeded the test passing criteria (significantly at that e.g. From the logs, there are clear signs that From Grafana, the only thing that stand out to me is elevated pmax latency on the non-perturbed nodes during the perturbation interval, up to 10s. From the statement bundles, there are plenty of requests that took over 40ms, but only 2 were in the multiple seconds range. Both of them seem to involve
From some of the other bundles, there are other slow parts, also involving
But this is expected given the perturbation, right? What am I missing? |
@miraradeva Thanks for digging into this. The goal of this test is to demonstrate that backfills do not cause overload on the cluster, so in this case it is correctly failing. The thing to track down is why those 2s requests took so long. I took a look at one of them and I see this:
Which is a very long time to be waiting between those two steps since they are in-memory. Potentially this request was waiting on a latch or the scheduler was overloaded. Looking at the graph around the time of the failure (14:38:40 ) It seems like only one node - n12 is slow: It also has >2s log commit latency. Since this is running with the default AC mode (apply to elastic) we expected it to prevent this. I checked the CPU and goroutine load on this node and didn't see anything unusually high, which made me suspect a lock was held for 2s on this replica (r6268). I also checked if there was anything unusual on this node:
This suggests that n12 was the correct leaseholder and there were not any recent range operations (lease transfers or split/merges). Looking at this time on the n12 logs we see a lot of messages related to the node being overloaded from 14:38:02 until 14:38:40 such as these:
I looked at the disk metrics and see very high IOPS and write bandwidth on this node: I'm going to try and dig into what happens between those lines that might touch disk (specifically either a read or a syncing write). I don't think there are any writes. |
The system has relatively low CPU load (30%) and low P99.9 go scheduler latency (~1ms). The disk IO is high for both reads and writes but that normally shouldn't impact raft scheduler latency. The part that is confusing to me is why the P99.9 raft scheduler (raft_scheduler_latency ) would be 2s while the P99 is only 3-4ms. If the scheduler was overloaded and had a queue then I’d expect both the P99 and P99.9 to move more in sync and be closer to each other. It makes me think one of two things is happening
Looking through the code, I don’t see any replica level locks, and I think we have 128 shards for a 16vCPU system which implies that it is most likely one shard that is slow. However looking at the slow range ids, they don't correspond cleanly with a multiple of 16 (or 15) which would be what I'd expect if it was a slow shard. I'll keep this open to see if this re-occurs. |
@ajwerner I was thinking the same thing. Having the ability to look at this situation with side-eye could really help in a case like this. Based on what the root issue is, some changes that could help are:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ a653b4e4e6483cec7d65808ba4d55d8c63747a6e:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 51693691ed763f700dd06fa2d001cce1ffd42203:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 01f9e61532862cbbdbc64180013ca1cb57f4a017:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ d58f071217505b226a097d39f87072f4e1bd8e06:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ a0faa4e779ac9e54d17e8400141b0aa472887974:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 6e405b0552362bad6f0df5261376193fbf8b98cd:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 57aab736c34ce5dc7988bd53e0604fde48cef441:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ 3ce8f44d1e033036783687e3c7ccb125d8de100b:
Parameters:
|
roachtest.perturbation/full/backfill failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:
Parameters:
|
The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped. Fixes: cockroachdb#137093 Fixes: cockroachdb#137392 Informs: cockroachdb#133114 Release note: None
138688: server: fix admin server Settings RPC redaction logic r=kyle-a-wong a=kyle-a-wong Previously admin.Settings only allowed admins to view all cluster settings without redaction. If the requester was not an admin, would use the isReportable field on settings to determine if the setting should be redacted or not. This API also had outdated logic, as users with the MODIFYCLUSTERSETTINGS should also be able to view all cluster settings (See #115356 for more discussions on this). This patch respects this new role, and no longer uses the `isReportable` setting flag to determine if a setting should be redacted. This is implemented by query `crdb_internal.cluster_settings` directly, allowing the sql layer to permission check. This commit also removes the `unredacted_values` from the request entity as well, since it is no longer necessary. Ultimately, this commit updates the Settings RPC to have the same redaction logic as querying `crdb_internal.cluster_settings` or using `SHOW CLUSTER SETTINGS`. Epic: None Fixes: #137698 Release note (general change): The /_admin/v1/settings API now returns cluster settings using the same redaction logic as querying `SHOW CLUSTER SETTINGS` and `crdb_internal.cluster_settings`. This means that only settings flagged as "sensitive" will be redacted, all other settings will be visible. The same authorization is required for this endpoint, meaning the user must be an admin or have MODIFYCLUSTERSETTINGS or VIEWCLUSTERSETTINGS roles to hit this API. The exception is that if the user has VIEWACTIVITY or VIEWACTIVITYREDACTED, they will see console only settings. 138967: crosscluster/physical: return job id in SHOW TENANT WITH REPLICATION STATUS r=dt a=msbutler Fixes #138548 Release note (sql change): SHOW TENANT WITH REPLICATION STATUS will now display the `ingestion_job_id` column after the `name` column. 139043: crosscluster/logical: ensure offline scan procs shut down before next phase r=dt a=msbutler This patch adds a check that attempts to wait for the offline scan processors to spin down before transitioning to steady state ingestion or OnFailOrCancel during an offline scan. Epic: none Release note: none 139219: roachtest: disable backfill success check r=stevendanna a=andrewbaptist The perturbation/*/backfill tests are flaky and are failing at least once a week with the default configuration. This change temporarily disables the check to allow easier investigation of the other failure modes such as backfill failing to complete and node OOMs. Once those are closed, and the test is running more stably, this threshold can be dropped. Fixes: #137093 Fixes: #137392 Informs: #133114 Release note: None 139259: sql: deflake TestIndexBackfillFractionTracking r=rafiss a=rafiss Recent changes added some concurrency to index backfills, so the testing hook needs a mutex to prevent concurrent access. fixes #139213 Release note: None Co-authored-by: Kyle Wong <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
roachtest.perturbation/full/backfill failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:
Parameters:
acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-45381
The text was updated successfully, but these errors were encountered: