-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: deflake test that joins a server with non-voting servers to form quorum #12130
Conversation
…rm qourum This PR - upgrades the serf library - has the test start the join process using the un-joined server first - disables schedulers on the servers - uses the WaitForLeader and wantPeers helpers Not sure which, if any of these actually improves the flakiness of this test.
213fc39
to
bd03d25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The flake has been happening so frequently that do we maybe want to just re-run CI a few times on this PR to see if it happens again? I was running into it maybe 50/50 so we could build confidence in the fix quickly. (Or because it seems like a good change overall, we could also just go ahead and merge and see if it helps.)
5 winners in a row in https://app.circleci.com/pipelines/github/hashicorp/nomad?branch=flakey-serf-non-voter&filter=all |
PR #12130 refactored the test to use the `wantPeers` helper, but this function only returns the number of voting peers, which in this test should be equal to 2. I think the tests were passing back them because of a bug in Raft (hashicorp/raft#483) where a non-voting server was able to transition to candidate state. One possible evidence of this is that a successful test run would have the following log line: ``` [email protected]/raft.go:1058: nomad.raft: updating configuration: command=AddVoter server-id=127.0.0.1:9101 server-addr=127.0.0.1:9101 servers="[{Suffrage:Voter ID:127.0.0.1:9107 Address:127.0.0.1:9107} {Suffrage:Voter ID:127.0.0.1:9105 Address:127.0.0.1:9105} {Suffrage:Voter ID:127.0.0.1:9103 Address:127.0.0.1:9103} {Suffrage:Voter ID:127.0.0.1:9101 Address:127.0.0.1:9101}]" ``` This commit reverts the test logic to check for peer count, regardless of voting status.
PR #12130 refactored the test to use the `wantPeers` helper, but this function only returns the number of voting peers, which in this test should be equal to 2. I think the tests were passing back them because of a bug in Raft (hashicorp/raft#483) where a non-voting server was able to transition to candidate state. One possible evidence of this is that a successful test run would have the following log line: ``` [email protected]/raft.go:1058: nomad.raft: updating configuration: command=AddVoter server-id=127.0.0.1:9101 server-addr=127.0.0.1:9101 servers="[{Suffrage:Voter ID:127.0.0.1:9107 Address:127.0.0.1:9107} {Suffrage:Voter ID:127.0.0.1:9105 Address:127.0.0.1:9105} {Suffrage:Voter ID:127.0.0.1:9103 Address:127.0.0.1:9103} {Suffrage:Voter ID:127.0.0.1:9101 Address:127.0.0.1:9101}]" ``` This commit reverts the test logic to check for peer count, regardless of voting status.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR
Not sure which, if any of these actually improves the flakiness of this test.
Closes #12115 ... hopefully 🐈