-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: admission-control/index-backfill failed #105260
Comments
Something is severely messed up here, seeing multiple nodes erroring out due to Raft log data loss:
We also saw this in #105261. However, I'm running Let's see if this comes up in any further roachtest failures (ones that are easier to bisect). If it's as widespread as we're seeing here, I'd expect it to blow up all over the place. The following PRs merged yesterday, and could possibly be relevant: |
Hm, it looks like we hit some other panics here too, when adding RPC metrics around here: Line 178 in caaa7d8
@tbg made some recent changes here in #99191.
|
We actually saw this across several nodes, which did not see the Raft panics:
|
Rough timeline:
|
Unfortunately, a lot of the interesting logs here have rotated out, because the logs are spammed with trace events from #102793. |
n3 died right after it started up. It managed to apply 2 snapshots, then it errored out.
There are no further logs for r8411, since the logs have all rotated out. Could be an issue with snapshots. |
There's something else that's really weird here though. In one of the RPC panics, it saw 2 different IPs for n3 (which is what caused the panic):
These IPs appear to belong to n3 from two different clusters. From
Crosstalk between clusters would definitely explain this, but why didn't the cluster ID checks trip? |
Wait a minute... These tests use volume snapshots to set up the test fixture. Does that mean that all clusters from the same fixtures have the same cluster IDs? Maybe this cluster got tangled up with the other cluster in #105261 that also used these snapshots, and also saw Raft panics? |
Sure enough, the volume snapshots use a static cluster ID
I don't know where it's getting these IPs from, but there's definite cross-talk here. I also see us contacting a bunch of other clusters, but these are rejected due to cluster ID mismatches:
Not sure what the best solution here is. I'm going to hand this one over to @irfansharif and @cockroachdb/test-eng. |
cc @cockroachdb/test-eng |
These two roachtests previously attempted to (opportunistically) share disk snapshots. In cockroachdb#105260 we observed that when running simultaneously, the two clusters end up using the same cluster ID and there's cross-talk from persisted gossip state where we record IP addresses. This commit prevents this snapshot re-use, giving each test its own one. While here, we reduce the cadence of these tests to be weekly runs instead, since they've (otherwise) been non-flakey. Release note: None
roachtest.admission-control/index-backfill failed with artifacts on master @ 4a614f89cea81bf94674d6072c3bbf30502244d4:
Parameters: |
Last failure is infra flake.
|
cc #103316 (just to keep track of these occurrences). |
roachtest.admission-control/index-backfill failed with artifacts on master @ 1eb628e8c8a7eb1cbf9bfa2bd6c31982c25cbba0:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-28947
The text was updated successfully, but these errors were encountered: