-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: gossip/chaos/nodes=9 failed #38829
Comments
This happens at like the 3rd iteration of stopping cluster, starting cluster, waiting for stabilization. The actual stabilization check is supposed to fatal after 20s: cockroach/pkg/cmd/roachtest/gossip.go Lines 57 to 63 in 0a776f4
Unfortunately that gossip loop logs to stdout which is hard to read, but I think all it ever prints is
I think this means that the loop got stuck in this method: cockroach/pkg/cmd/roachtest/gossip.go Lines 38 to 54 in 0a776f4
When the test runner times out the test it also panics itself:
|
It was printing to Stdout, where the output is hard to find. Touches cockroachdb#38829. Release note: None
39049: roachtest: print to correct logs in gossip/chaos r=nvanbenschoten a=tbg It was printing to Stdout, where the output is hard to find. Touches #38829. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/77f26d185efb436aaac88243de19a27caa5da9b6 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1509340&tab=artifacts#/gossip/chaos/nodes=9
|
SHA: https://github.com/cockroachdb/cockroach/commits/f9a102814bdce90d687f6215acadf10a9d784c29 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1555992&tab=artifacts#/gossip/chaos/nodes=9
|
SHA: https://github.com/cockroachdb/cockroach/commits/8b9f54761adc58eb9aecbf9b26f1a7987d8a01e5 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1573251&tab=artifacts#/gossip/chaos/nodes=9
|
SHA: https://github.com/cockroachdb/cockroach/commits/7dec3577461a1e53ea70582de62bbd96bf512b73 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1614006&tab=artifacts#/gossip/chaos/nodes=9
|
(roachtest).gossip/chaos/nodes=9 failed on master@e81faed54ee90bdfed1dddc63bb13d1ecf8806da:
details
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@299f2729c546b9a2444665c7ccd4b07e62bc84b5:
details
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@2f1e342c386973c35246bef68c177fcd0b8b609b:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@277e28b2ea78929563e85cb1c9efc573f37408a4:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@dd9b1c1f40dda59ee9d446416106d311ae5ce1e6:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@d54e001248f233734ff926daef5470487c5616b0:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@8f946b8bb62629002c5e958373b81ca9e1920dd6:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@edd46b1405b7928760b65ec8aad59fd22a8adb6b:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on release-19.1@cf4d9a46193b9fc2c63b7adf89b3b0b5c84adb2b:
Repro
Artifacts: /gossip/chaos/nodes=9
powered by pkg/cmd/internal/issues |
(roachtest).gossip/chaos/nodes=9 failed on master@2559992b91b6823110c5a7ef8bc0d61395ca6f28:
|
(roachtest).gossip/chaos/nodes=9 failed on master@545c96573d36519c27115278187fdd441f54a64d:
|
(roachtest).gossip/chaos/nodes=9 failed on master@9bf1af0ec1b352419f621d7329fee5cd12d0bcdb:
|
(roachtest).gossip/chaos/nodes=9 failed on master@08bb94d98b7f60126666e9c5a317afca61422d7f:
|
From
Notice the timestamps here. Shortly after
Presumably this happened right after |
Hmm, I get a better idea of what's going wrong from these logs, but it unfortunately hints at the bowels of raft. It does look like the failure to acquire a lease on part of n1 is the root cause of the stall. A minute in, the raft status
basically says: I'm a follower, but I don't know of any leader, and yet I'm not campaigning. Maybe it's regularly trying to campaign; we can't tell (since we use PreVote, which doesn't bump the term). Either way, raft isn't getting its act together here. And yet, n1 is definitely a healthy replica of what looks like a range that should work "just fine", since it has members
n2 went down around 20:19:52. All we need for things to work is for someone to get elected raft leader. (it's safe to assume that n2 was the raft leader at the time of its demise). But basically nothing happens, until the situation seems to resolve itself a few minutes in:
(the other nodes quickly follow suit). All that is to say, ugh, I think we really need to see the raft chatter here? It looks like |
Oh, and I should also mention that this is similar to how leases can hang when a replica is gc'able but doesn't know it and gets an inbound request. But then, in this example, we'd expect to see a) a removal of the n1 replica being committed on other nodes while n1 is down and b) the n1 replica to eventually get replicaGC'ed unblocking things. We see neither here. |
Deflake `gossip/chaos` by adding a missing `waitForFullReplication`. This test loops, killing a node and then verifying that the remaining nodes in the cluster stabilize on the same view of gossip connectivity. Periodically the test was failing because gossip wasn't stabilizing. The root issue was that the SQL query to retrieve the gossip connectivity from one node was hanging. And that query was hanging due to unavailability of a range. Logs show that the leaseholder for that range was on a down node and that the range only seemed to contain a single replica. This could happen near the start of the test if we started killing nodes before full replication was achieved. Fixes cockroachdb#38829 Release note: None
I've pushed all of the changes I'm running with to #44926. Reproduction amounted to |
(roachtest).gossip/chaos/nodes=9 failed on master@de7a7bb804716835579e713ab7264f8af168a14f:
|
(roachtest).gossip/chaos/nodes=9 failed on master@2739821b911d777fa2a927295d699b559360a802:
|
(roachtest).gossip/chaos/nodes=9 failed on master@e671a4ef97cbc6cf5d22f8f322fd45733d302094:
|
(roachtest).gossip/chaos/nodes=9 failed on master@b797cad6d130714748983bc53d4611ddc6151153:
|
(roachtest).gossip/chaos/nodes=9 failed on master@fc5c7f093bf1e86852c3b839bc0f6710d9902729:
More
Artifacts: /gossip/chaos/nodes=9
See this test on roachdash |
Yeah, this reproduced readily. Again I saw a proposal hanging on the namespace table. Going to throw some more logging in to get at the internals. |
Current repro (note the improved message that I picked up via rebasing). We see that a lease request is proposed on s6. Nothing happens for a few minutes - and nothing means that we don't even see reproposals, which indicates that the replica is either stuck or quiesced. I assume it is quiesced, because the activity of the replicate queue on a different node + the raft election timeout equals precisely when things unwedge. Time for some more logging. The replica isn't supposed to quiesce when it has pending proposals but maybe there's a hole.
|
Yep it's quiesced. I'll continue tracking this down tomorrow, can't be hard now because we can add an assertion. |
Interesting that this is looping back around to me (I wrote the quiescence code originally). I wonder if this is something new, or something that has been there a long time. Eagerly awaiting the continuation of this debugging saga. |
What I see in the logs is that everybody quiesces well before the proposal comes in (which makes sense, otherwise this would be very rare and it's not). So I think we're looking at a sequence of events in which the proposal does not lead to unquiescing the replica.
It's likely that it does and that this check somehow misfires. The code is here: Once this is called (since it should, since we think it got scheduled) we should have I put logging into each of them (thankfully we always get stuck on r26, so this is easy to target), so next repro will tell us more. |
20 runs passed. This better repro again now. |
Oh I got one and I see what this is. What breaks is that in the above there's this implicit invariant that "if you put a proposal into raft, there will be a ready". But that's not true, because
so the replica remains quiesced. Quiesced replicas aren't ticked, so nothing ever happens until some queue wakes up the leader. It's fairly clear that we need the replica to wake up in this situation. What I don't understand is... why is it not aware of a leader? It can't be quiesced and not know a leader, for it wouldn't... quiesce in that state? Hmm, it doesn't look like we check that. The system will happily quiesce even if Raft already figured out that the leader is unresponsive: cockroach/pkg/kv/kvserver/store_raft.go Lines 222 to 225 in bc52ebf
(
(this is from calling This seems like a compelling theory. I'm going to run a few iters with patches. |
I thought a follower only accepted quiesce messages from the leader. That was the intention at least. A leader can quiesce itself and its followers. A follower should only quiesce in response to the current leader. |
No, that is not the case. The follower checks that the commit index for the quiesce message matches the local one plus it matches the term (so the message is really from the leader for that term), but it doesn't actually verify that the local instance knows that this is the leader. I'll be sending a PR either today or tomorrow, right now making sure the repro disappears. |
46045: kvserver: avoid hanging proposal after leader goes down r=nvanbenschoten,petermattis a=tbg Deflakes gossip/chaos/nodes=9, i.e. Fixes #38829. There was a bug in range quiescence due to which commands would hang in raft for minutes before actually getting replicated. This would occur whenever a range was quiesced but a follower replica which didn't know the (Raft) leader would receive a request. This request would be evaluated and put into the Raft proposal buffer, and a ready check would be enqueued. However, no ready would be produced (since the proposal got dropped by raft; leader unknown) and so the replica would not unquiesce. This commit prevents this by always waking up the group if the proposal buffer was initially nonempty, even if an empty Ready is produced. It goes further than that by trying to ensure that a leader is always known while quiesced. Previously, on an incoming request to quiesce, we did not verify that the raft group had learned the leader's identity. One shortcoming here is that in the situation in which the proposal would originally hang "forever", it will now hang for one heartbeat or election timeout where ideally it would be proposed more reactively. Since this is so rare I didn't try to address this. Instead, refer to the ideas in #37906 (comment) and #21849 for future changes that could mitigate this. Without this PR, the test would fail around 10% of the time. With this change, it passed 40 iterations in a row without a hitch, via: ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9 Release justification: bug fix Release note (bug fix): a rare case in which requests to a quiesced range could hang in the KV replication layer was fixed. This would manifest as a message saying "have been waiting ... for proposing" even though no loss of quorum occurred. Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/07607a73daedbf57f47e42e2e3d6cfd529ab65a5
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1382906&tab=buildLog
The text was updated successfully, but these errors were encountered: