core: store and check for Raft version changes #12362

lgfa29 · 2022-03-23T19:16:05Z

Downgrading the Raft version protocol is not a supported operation.
Checking for a downgrade is hard since this information is not stored in
any persistent place. When a server re-joins a cluster with a prior Raft
version, the Serf tag is updated so Nomad can't tell that the version
changed.

Mixed version clusters must be supported to allow for zero-downtime
rolling upgrades. During this it's expected that the cluster will have
mixed Raft versions. Enforcing consistency strong version consistency
would disrupt this flow.

The approach taken here is to store the Raft version on disk. When the
server starts the raft_protocol value is written to the file
data_dir/raft/version. If that file already exists, its content is
checked against the current raft_protocol value to detect downgrades
and prevent the server from starting.

Any other types of errors are ignore to prevent disruptions that are
outside the control of operators. The only option in cases of an invalid
or corrupt file would be to delete it, making this check useless. So
just overwrite its content with the new version and provide guidance on
how to check that their cluster is an expected state.

Closes #11867

Downgrading the Raft version protocol is not a supported operation. Checking for a downgrade is hard since this information is not stored in any persistent place. When a server re-joins a cluster with a prior Raft version, the Serf tag is updated so Nomad can't tell that the version changed. Mixed version clusters must be supported to allow for zero-downtime rolling upgrades. During this it's expected that the cluster will have mixed Raft versions. Enforcing consistency strong version consistency would disrupt this flow. The approach taken here is to store the Raft version on disk. When the server starts the `raft_protocol` value is written to the file `data_dir/raft/version`. If that file already exists, its content is checked against the current `raft_protocol` value to detect downgrades and prevent the server from starting. Any other types of errors are ignore to prevent disruptions that are outside the control of operators. The only option in cases of an invalid or corrupt file would be to delete it, making this check useless. So just overwrite its content with the new version and provide guidance on how to check that their cluster is an expected state.

lgfa29 · 2022-03-23T19:21:41Z

nomad/testing.go

+	return s, c
+}
+
+func TestServerWithErr(t *testing.T, cb func(*Config)) (*Server, func(), error) {


I'm not sure if this is the best approach. I need to test the server doesn't start, but it was always failing the test due to the t.Fatalf.

I found this other approach, but it doesn't sound right for this case.

Seems reasonable

tgross

LGTM overall, my comments are mostly nitpicking over error messages 😀

nomad/server.go

tgross · 2022-03-23T20:47:29Z

nomad/testing.go

+	return s, c
+}
+
+func TestServerWithErr(t *testing.T, cb func(*Config)) (*Server, func(), error) {


Seems reasonable

nomad/server.go

shoenig

LGTM; just nitpicks

nomad/testing.go

nomad/server_test.go

nomad/server.go

github-actions · 2022-10-24T02:46:27Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lgfa29 added 2 commits March 23, 2022 15:10

changelog: add entry for #12362

67c112c

vercel bot temporarily deployed to Preview – nomad March 23, 2022 19:18 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 23, 2022 19:18 View deployment

lgfa29 requested review from tgross and shoenig March 23, 2022 19:19

lgfa29 added this to the 1.3.0 milestone Mar 23, 2022

lgfa29 commented Mar 23, 2022

View reviewed changes

tgross approved these changes Mar 23, 2022

View reviewed changes

shoenig approved these changes Mar 23, 2022

View reviewed changes

nomad/testing.go Outdated Show resolved Hide resolved

nomad/testing.go Outdated Show resolved Hide resolved

nomad/server_test.go Outdated Show resolved Hide resolved

nomad/server.go Outdated Show resolved Hide resolved

address some code review comments

6cafda2

vercel bot temporarily deployed to Preview – nomad March 23, 2022 23:01 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 23, 2022 23:01 View deployment

tgross approved these changes Mar 24, 2022

View reviewed changes

lgfa29 merged commit 0783ac6 into main Mar 24, 2022

lgfa29 deleted the f-store-raft-version branch March 24, 2022 18:42

github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: store and check for Raft version changes #12362

core: store and check for Raft version changes #12362

lgfa29 commented Mar 23, 2022

lgfa29 Mar 23, 2022

tgross Mar 23, 2022

tgross left a comment

tgross Mar 23, 2022

shoenig left a comment

github-actions bot commented Oct 24, 2022

core: store and check for Raft version changes #12362

core: store and check for Raft version changes #12362

Conversation

lgfa29 commented Mar 23, 2022

lgfa29 Mar 23, 2022

Choose a reason for hiding this comment

tgross Mar 23, 2022

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross Mar 23, 2022

Choose a reason for hiding this comment

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 24, 2022