-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/roachtest: deflake gossip/chaos roachtest #44926
cmd/roachtest: deflake gossip/chaos roachtest #44926
Conversation
@tbg |
We should definitely merge this, but are you sure it fixes #38829? I don't understand that failure mode at all where it would complete after a few minutes or so (as opposed to getting stuck indefinitely). |
Also, more generally, using SQL to get the connectivity is probably the wrong thing to do in this test. In the real world, clusters may have lost quorum and we still want gossip to connect 100% of the time. So the test should check the connectivity in a more robust way. Not putting any of that on anyone's plate though. I'm fine calling that issue fixed if the test stops failing (as it seems it will). |
Definitely fixes the failures I was seeing, which includes the ones in #38829. I'm not sure why the SQL query would sometimes return after ~300s. Is there a 5m timeout on some context somewhere? Note that I was seeing that failure mode myself. With this PR it appears to be gone. |
2ab0be5
to
38f4bd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/gossip.go, line 92 at r1 (raw file):
} } c.l.Printf("gossip ok: %s (%0.0fs)\n", expected, timeutil.Since(start).Seconds())
Also converted this spurious fmt.Printf
to c.l.Printf
.
Running on real clusters produced |
If there were a timeout, the SQL query would return an error, which would lead to a t=0: we check n1. We get some connectivity C back. So this all makes sense, only the fact that the query stalls is problematic. I still think this is likely to be caused by getting a request stuck on a gc'able replica. You've identified Open question is why that can take minutes. The scanner loop is ~10 minutes by default, though in the scenario of the test we expect a reactive GC, i.e. within seconds, since former peers of the replica ought to be around. With a |
See my new comments on #38829. I believe there are two different failure modes for this test. The first is fixed by this PR and is caused by a replica losing quorum which causes The second failure mode is the SQL query returning successfully in around ~5 min. I believe the timer we're hitting is time-until-store-dead. #38829 (comment) indicates that we're failing to query PS I'm at an offsite meeting all day and won't have a chance to look at this at all. |
Deflake `gossip/chaos` by adding a missing `waitForFullReplication`. This test loops, killing a node and then verifying that the remaining nodes in the cluster stabilize on the same view of gossip connectivity. Periodically the test was failing because gossip wasn't stabilizing. The root issue was that the SQL query to retrieve the gossip connectivity from one node was hanging. And that query was hanging due to unavailability of a range. Logs show that the leaseholder for that range was on a down node and that the range only seemed to contain a single replica. This could happen near the start of the test if we started killing nodes before full replication was achieved. Fixes cockroachdb#38829 Release note: None
38f4bd9
to
a7a9146
Compare
This PR is now a part of #46045. Thanks for all the prep work, @petermattis. Turns out there was a real KV bug here. |
Deflake
gossip/chaos
by adding a missingwaitForFullReplication
. This test loops, killing a node and thenverifying that the remaining nodes in the cluster stabilize on the same
view of gossip connectivity. Periodically the test was failing because
gossip wasn't stabilizing. The root issue was that the SQL query to
retrieve the gossip connectivity from one node was hanging. And that
query was hanging due to unavailability of a range. Logs show that the
leaseholder for that range was on a down node and that the range only
seemed to contain a single replica. This could happen near the start of
the test if we started killing nodes before full replication was
achieved.
Fixes #38829
Release note: None