-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: store and check for Raft version changes #12362
Conversation
Downgrading the Raft version protocol is not a supported operation. Checking for a downgrade is hard since this information is not stored in any persistent place. When a server re-joins a cluster with a prior Raft version, the Serf tag is updated so Nomad can't tell that the version changed. Mixed version clusters must be supported to allow for zero-downtime rolling upgrades. During this it's expected that the cluster will have mixed Raft versions. Enforcing consistency strong version consistency would disrupt this flow. The approach taken here is to store the Raft version on disk. When the server starts the `raft_protocol` value is written to the file `data_dir/raft/version`. If that file already exists, its content is checked against the current `raft_protocol` value to detect downgrades and prevent the server from starting. Any other types of errors are ignore to prevent disruptions that are outside the control of operators. The only option in cases of an invalid or corrupt file would be to delete it, making this check useless. So just overwrite its content with the new version and provide guidance on how to check that their cluster is an expected state.
nomad/testing.go
Outdated
return s, c | ||
} | ||
|
||
func TestServerWithErr(t *testing.T, cb func(*Config)) (*Server, func(), error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the best approach. I need to test the server doesn't start, but it was always failing the test due to the t.Fatalf
.
I found this other approach, but it doesn't sound right for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, my comments are mostly nitpicking over error messages 😀
nomad/testing.go
Outdated
return s, c | ||
} | ||
|
||
func TestServerWithErr(t *testing.T, cb func(*Config)) (*Server, func(), error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; just nitpicks
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Downgrading the Raft version protocol is not a supported operation.
Checking for a downgrade is hard since this information is not stored in
any persistent place. When a server re-joins a cluster with a prior Raft
version, the Serf tag is updated so Nomad can't tell that the version
changed.
Mixed version clusters must be supported to allow for zero-downtime
rolling upgrades. During this it's expected that the cluster will have
mixed Raft versions. Enforcing consistency strong version consistency
would disrupt this flow.
The approach taken here is to store the Raft version on disk. When the
server starts the
raft_protocol
value is written to the filedata_dir/raft/version
. If that file already exists, its content ischecked against the current
raft_protocol
value to detect downgradesand prevent the server from starting.
Any other types of errors are ignore to prevent disruptions that are
outside the control of operators. The only option in cases of an invalid
or corrupt file would be to delete it, making this check useless. So
just overwrite its content with the new version and provide guidance on
how to check that their cluster is an expected state.
Closes #11867