-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: transfer-leases/drain-other-node failed #104304
Comments
cockroach/pkg/cmd/roachtest/tests/quit.go Line 94 in c4d85c9
Drain is given 1m to finish, but it takes just a bit longer.
BTW, what's with the occasional jumps to "node is draining... remaining: 555"? |
In #104185 too we saw drain get stuck-ish. Wonder if it's related. |
Looks like it got stuck on a single range for a while. Like @irfansharif mentioned it just missed the 1 minute timeout. Successfully draining (without shutdown). The
These log lines are odd. They are probably due the context being called with cockroach/pkg/kv/kvserver/store.go Line 1533 in 7754c4b
So the test isn't that far off passing (less than a second), but the stuck lease for The lease isn't being transferred because the allocator doesn't find a valid lease target. A lease transfer candidate for the prior and following range are found during the first pass:
This indicates that it isn't a store/node level condition we are filtering on, but range level. This also occurs for
What range level state would prevent a replica being a valid target? We know it is failing here: cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go Lines 1735 to 1738 in 7754c4b
If we exclude any node/store related checks, the following could invalidate a candidate:
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go Lines 1605 to 1606 in 7754c4b
The range log shows the last activity for Range log entry
So I somewhat doubt there aren't any voters, it even shows them in the descriptor being logged on failure throughout. It seems plausible that perhaps we are hitting the 2nd condition where we incorrectly think both of the other replicas may need a snapshot. I filed #105066 to give an answer. I can't figure out how we ended up in this weird state, nor how it seemingly resolves itself after. The range status from both
I'm going to leave this test open but remove the release blocker. The behavior is unexpected but it does eventually succeed. With added obs into why there were no valid candidates, we could dig deeper - I'm suspecting a bug and not just a flake. Footnotes
|
roachtest.transfer-leases/drain-other-node failed with artifacts on release-22.2 @ 2124c6b9adc61030bedc4a6a67c025566f989201:
Parameters: |
Got a repro with v=5 logging. The culprit is replicas in need of a snapshot!
It is unexpected that a replica requires a snapshot for so long. I'll dig into why the leaseholder believes both of the other voter replicas require a snapshot. |
Added some more logging. The draining node cannot transfer away the lease because it considers the other replicas to be in
Unfortunately, the debug.zip generated doesn't show any replicas in r48 range.json
A question for the @cockroachdb/replication folks – what could lead to not exiting I noticed there's been some recent activity to ensure ranges don't quiesce with followers in |
#103827 was backported to 22.2 FWIW. However, there's another related PR, #104212, which wasn't. I wonder if that's a contributing factor here. Btw, I'm seeing a lot of these messages, which I think you've added manually since I don't find them in the code? Would they also prevent lease transfers? Why are they considered non-live? Liveness?
Anyway, ignoring those, we see n2 receive the lease here:
Shortly after, we see a Raft status:
Notice how both 2 and 3 (the leader n2) are in Some seconds later, we see that 1 has now caught up at index 61 and is in
And then a few seconds later, both followers are in
Here's what I'm guessing happened: once replica 1 caught up to index 61, all replicas were in Let's see where this could happen:
The last case is the one that would be covered by #104212. Basically, we must've failed to send a heartbeat to the nodes, presumably the heartbeat where we quiesced the range since we otherwise don't send them (or alternatively a previously queued heartbeat). I'll see if I can find any indications of that. There are also a few cases in etcd/raft where we call |
Sure enough, we see n2 failing connections to n1 and n3 right around this time:
I think it's plausible that #104212 would fix this then. I'll give it a try. |
Still saw 1 failure out of 100 runs with #104212. Interestingly, we're seeing the lease count bounce back up to >500 leases occasionally.
Didn't run this with any additional logging or anything, unfortunately.
I can try another run with this and see what falls out. |
Repro branch here: kvoli@d3fb6f0 It means the candidate was filtered by the storepool for being suspect/draining/unavailable/dead. It is a bit rough - I'm thinking we could probably benefit from logging which is on by default (not vmodule) for drain lease transfers in the allocator. There's this issue #105066 which captures this.
Thanks for the pointers - this was very helpful.
Thanks! Let me know what you find. I'm going to stop actively stressing this test for now. |
I think I found the issue, see #101624 (comment). |
cc @cockroachdb/replication |
Resolved by #106805. |
roachtest.transfer-leases/drain-other-node failed with artifacts on release-22.2 @ cfc200070bbf9fc4b6bc6c6556b5fd56f48db381:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-28449
Epic CRDB-27234
The text was updated successfully, but these errors were encountered: