-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: failover/chaos/read-only failed #106108
Comments
Something's pretty messed up here. Looks like we're losing r1 meta availability somehow. n1 rebalances away its r1 replica at 11:53:20:
Then about a minute later we start seeing the first signs of r1 unavailability:
Notice that this is not 60 seconds after the replica was removed at 11:53:20, the request began around 11:53:10. The only suspect log entry we see at that time is this one:
We keep seeing signs of r1 unavailability throughout the cluster until the test times out an hour later. However, r1 is alive and well with leader/leaseholder on n6:
We must somehow have stopped gossiping r1 or something. However, we see fresh gossip info for r1 pointing to n6, here from n1:
|
The intent resolution error appears to be a bare cockroach/pkg/kv/kvpb/errors.proto Lines 698 to 726 in eaca25f
cockroach/pkg/kv/kvpb/errors.go Line 383 in eaca25f
Not clear if we can recover the original error, or if it's significant here. |
On n6 (the new r1 leaseholder), we're seeing 9 goroutines stuck on intent resolution. None of the other nodes have this. One of them have been stuck for 65 minutes, since the outage began: Intent resolution goroutine stack
We also see these failed pushes:
|
I'm also seeing a bunch of slow latch acquisitions, but only on n2, across 66 ranges. This began right around the time we started having issues with r1. Unclear if it's causal or just fallout.
It does have 34 write commands that have been replicating for 65 minutes, all of which appear to be txn heartbeats: executeWriteBatch stack
No replica circuit breakers were tripped. |
All of the stuck ranges on n2 follow the same pattern. They have a couple of pending proposals as well as read/write latches. Here's r352's range info from n2: Details
Interestingly, I even found a few of these that were quiesced: Details
At this point, I'm starting to suspect proposal issues. Wdyt @tbg? |
cc @cockroachdb/replication |
Copying over this summary from Slack, to avoid having to piece it together from the above.
|
Starting this investigation from scratch to get my bearings. First step: look at the combined logs and find the first slow latch ("have been waiting" message), which is usually a good way to figure out when replication-related trouble began. This happens on r352, at 15s before that point, we see an The new thing we see at Spot checking a few other ranges, there is a pattern. 85 ranges logged slow latches in the test.
When you look at the distribution on when they got stuck, there is a very strong correlation. These are not independent events - almost all of the slow latches are logged at Details
|
Flipping through the raft state for the affected ranges, for most of them n2 is the leader+leaseholder and all followers are in Details
We see 63 ranges report "raft receive queue is full". Notably, this is fewer than the 85 ranges we have in total, so maybe there is more than one way in which things broke down. Again, there doesn't seem to be a reason why any receive queue should fill up. After all, looking at e.g. r352, its last index is 19. There simply isn't much traffic on these ranges, and they're stuck anyway. |
Couldn't the receive queues filling up simply be heartbeat responses and such? Normally I'd suspect some kind of mutex deadlock here, but I didn't immediately see anything in the stacks. Could this be some kind of livelock scenario? (EDIT: I originally looked at n1 and n6 stacks, where r1 was stuck, not n2) |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This smelled like some sort of deadlock, so I looked through the stacks. This one caught my eye right away: goroutine 155 [select, 65 minutes]: github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaScanner).RemoveReplica(...)
We seem to have deadlocked on replica removal on here, which deadlocks that replica's raftMu. It's not obvious why removing from the scanner would get stuck like this but all sorts of bad things result from it. Let's find out what the scanner is doing.We're trying to send to this channel cockroach/pkg/kv/kvserver/scanner.go Lines 184 to 186 in 24f141b
The consumer is this: cockroach/pkg/kv/kvserver/scanner.go Lines 217 to 249 in 24f141b
In the stacks we see that this goroutine... isn't there? That can't be good. Instead, we see
where it's trying to acquire a Replica mutex here: cockroach/pkg/kv/kvserver/store.go Lines 458 to 463 in ee1d5cc
TLDR: there's at least one deadlocked replica mu on n2. An immediate consequence of this is possible raft scheduler starvation. We see lots of the scheduler's goroutines stuck in the same Ok, so can we find out where the mutex leaked? I don't see an obvious way. I went through all goroutines containing The deadlock reminds me of what we saw in https://github.com/cockroachlabs/support/issues/2387#issuecomment-1602346347 but could never track down. Since we now seem to have a way in which this should at least be somewhat reproducible, we can annotate the mutex and hack together a version of #66765. I'll get started on that today. |
Kicking off the repro machine, on #106254. Let's hope that something shakes out soon, if it does then we'll know who leaked the mutex and that should pretty much settle it. If it doesn't repro... uhm, well, I am willing to try pretty hard to have it repro anyway. It happened once, it will happen again. |
This comment was marked as resolved.
This comment was marked as resolved.
I ran this on |
Well, the good news is that with #106267 we seem to actually be able to run this test. The bad news is, it doesn't seem to want to reproduce the original issue here. I ran all of the 🔴 🟢 in #106285 with #106254, i.e. it was doing only the splits and then exiting early, and this never reproduced the problem we are seeing here, over a total of, I'm guessing, 500 runs, including 300 on the SHA of this issue (which predates the other bad commit anyway, so didn't need the revert). Going to try for 9999 runs overnight, on this issue's SHA. |
All of the nightly 1643 runs have passed. |
Just for posterity, I didn't understand why I saw
I was confused by this because we do not see this line printed in the logs cockroach/pkg/cmd/roachtest/tests/failover.go Line 247 in ada8fc8
but we do see the "creating workload database" above the workload init. The workload init does not pass But I found out by code inspection that when the workload creates splits, it also scatters each split: cockroach/pkg/workload/workloadsql/workloadsql.go Lines 183 to 184 in c5a8090
So everything checks out and the repro attempts, sadly, actually made sense. I was hoping that I had missed something that was present in the original execution. |
Predictably all 100 passed. I have since run hundreds more and counting. |
The full test is passing repeatedly too, just as you'd think it would. There is one flake I have hit three times (out of 181 runs of the test),
I've spot checked the logs for one instance and it doesn't look like the deadlock. |
I also ran 3151 more iterations of the "shortened" test overnight. They all passed. edit, 2 days later:
didn't start due to a dumb error on my part. 24h timer starts now. |
😢 PASS failover/chaos/read-only (run 2071) |
Okay, time to throw the towel:
|
I filed #106568 instead (with release-blocker label for now, even though it's not a new bug). |
#106574 hopefully fixes this. |
roachtest.failover/chaos/read-only failed with artifacts on master @ 428dc9da6a320de218460de6c6c8807caa4ded98:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=2
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29407
The text was updated successfully, but these errors were encountered: