a NonVoter node should never be able to transition to a Candidate state #483

dhiaayachi · 2021-12-20T16:57:10Z

This behaviour was observed in Consul for read-replicas nodes (which are raft nonVoters node). When the cluster is not stable (leader transition) there is a possibility to a nonVoter node to step up as leader (which is not supposed to happen)

@mkeeler and me where able to reproduce the behaviour of having a nonVoter node transition to a Candidate state, which is an indication that this node is possibly able to win an election. The test do the following steps:

Create a cluster with 3 Voters node and 1 nonVoter node and wait for stability
Remove one follower node to trigger a configuration update
after the leader send an appendEntry log to the nonVoter to update the latest configuration and before it send the appendEntry to commit the new configuration, partition the leader
Check that the nonVoter transition to a Candidate state

Unfortunately step 3 is very time sensitive and we were not able to come with a test that would consistently reproduce the issue but we were able to confirm that after the fix in this PR the issue could not be reproduced.

raft.go

banks

I think this looks good. It's a shame it's so hard to validate deterministically with a unit test.

This is probably another case for our "one day" refactor of this library to make it easier to mock and drive in a controlled way in unit tests. I don't think we should block this fix on that though as it seems pretty clear that it's never right for a non-voter to enter Candidate state.

…`HeartbeatTimeout` is reached

dhiaayachi · 2022-01-04T19:58:03Z

I think this looks good. It's a shame it's so hard to validate deterministically with a unit test.

This is probably another case for our "one day" refactor of this library to make it easier to mock and drive in a controlled way in unit tests. I don't think we should block this fix on that though as it seems pretty clear that it's never right for a non-voter to enter Candidate state.

@banks reading your comment about testing I had the idea to test this at a more unit level to at least avoid a future regression and have the transition out of follower state defined in a test. Let me know what you think about that?

banks · 2022-01-04T22:18:38Z

raft_test.go

+			go env1.raft.runFollower()
+
+			// wait enough time to have HeartbeatTimeout
+			time.Sleep(tt.fields.conf.HeartbeatTimeout * 3)


I like having a test for this. I'm not sure how reliable this will be over time given that it's concurrent and timing dependent. Perhaps worth re-running this in CI a few times and seeing if it fails? We could increase the multiplier here if it does too I guess.

I ran it locally few times, I will try on the CI. The wait is deterministic because the follower loop will run after HeartbeatTimeout and there is no concurrency other then the test and the runFollower loop ( because we avoid starting the raft routines) so the state is in a controlled environment.

Sounds low risk then. In the past we've found even "safe" waits like this can eventually end up causing flakes in highly CPU starved CI environments where t.Parallel means that many more tests are running than real CPU cores available and so scheduling delays can last hundreds of milliseconds. If we also avoid running in parallel that probably helps too.

For now it seems OK just wanted to note that based on previous tests that have ended up flaky even 3 * heartbeat might not be enough in the long run!

Yes in consul we have a lot of those. I will retry the CI a couple of times before merging to see how it hold.

This is a follow-up to #483

PR #12130 refactored the test to use the `wantPeers` helper, but this function only returns the number of voting peers, which in this test should be equal to 2. I think the tests were passing back them because of a bug in Raft (hashicorp/raft#483) where a non-voting server was able to transition to candidate state. One possible evidence of this is that a successful test run would have the following log line: ``` [email protected]/raft.go:1058: nomad.raft: updating configuration: command=AddVoter server-id=127.0.0.1:9101 server-addr=127.0.0.1:9101 servers="[{Suffrage:Voter ID:127.0.0.1:9107 Address:127.0.0.1:9107} {Suffrage:Voter ID:127.0.0.1:9105 Address:127.0.0.1:9105} {Suffrage:Voter ID:127.0.0.1:9103 Address:127.0.0.1:9103} {Suffrage:Voter ID:127.0.0.1:9101 Address:127.0.0.1:9101}]" ``` This commit reverts the test logic to check for peer count, regardless of voting status.

only voters can transition to a Candidate state

5fce1b3

banks reviewed Dec 20, 2021

View reviewed changes

raft.go Outdated Show resolved Hide resolved

clarify log message and remove not used func inConfig

5fc96ef

banks approved these changes Jan 4, 2022

View reviewed changes

add tests to make sure leader transition happen as expected when the …

8807334

…`HeartbeatTimeout` is reached

fix race

a2f084a

banks reviewed Jan 4, 2022

View reviewed changes

dhiaayachi merged commit 656e6c0 into main Jan 5, 2022

dhiaayachi deleted the nonvoter_candidate branch January 5, 2022 20:15

dhiaayachi mentioned this pull request Jan 6, 2022

upgrade raft to v1.3.3 hashicorp/consul#11958

Merged

rboyer added a commit that referenced this pull request Mar 1, 2022

a NonVoter node should never try to bootstrap

827436e

This is a follow-up to #483

rboyer mentioned this pull request Mar 1, 2022

a NonVoter node should never try to bootstrap #492

Merged

1 task

rboyer added a commit that referenced this pull request Mar 1, 2022

a NonVoter node should never try to bootstrap (#492)

1979b11

This is a follow-up to #483

lgfa29 mentioned this pull request Aug 30, 2022

ci: fix TestNomad_BootstrapExpect_NonVoter test hashicorp/nomad#14407

Merged

hc-github-team-nomad-core mentioned this pull request Aug 30, 2022

Backport of ci: fix TestNomad_BootstrapExpect_NonVoter test into release/1.3.x hashicorp/nomad#14409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a NonVoter node should never be able to transition to a Candidate state #483

a NonVoter node should never be able to transition to a Candidate state #483

dhiaayachi commented Dec 20, 2021

banks left a comment

dhiaayachi commented Jan 4, 2022

banks Jan 4, 2022

dhiaayachi Jan 5, 2022

banks Jan 5, 2022

dhiaayachi Jan 5, 2022

a NonVoter node should never be able to transition to a Candidate state #483

a NonVoter node should never be able to transition to a Candidate state #483

Conversation

dhiaayachi commented Dec 20, 2021

banks left a comment

Choose a reason for hiding this comment

dhiaayachi commented Jan 4, 2022

banks Jan 4, 2022

Choose a reason for hiding this comment

dhiaayachi Jan 5, 2022

Choose a reason for hiding this comment

banks Jan 5, 2022

Choose a reason for hiding this comment

dhiaayachi Jan 5, 2022

Choose a reason for hiding this comment