From 4daef84731a82e7d5717b1f4a2b23f0724d6ae6e Mon Sep 17 00:00:00 2001 From: Tim Gross Date: Wed, 24 Nov 2021 11:37:31 -0500 Subject: [PATCH 1/3] raft: default to protocol v3 Many of Nomad's Autopilot features require raft protocol version 3. Set the default raft protocol to 3, and improve the upgrade documentation. --- command/agent/config.go | 1 + website/content/docs/configuration/server.mdx | 2 +- .../content/docs/upgrade/upgrade-specific.mdx | 51 ++++++++++++++----- 3 files changed, 41 insertions(+), 13 deletions(-) diff --git a/command/agent/config.go b/command/agent/config.go index 4040355d113..c355eea0df1 100644 --- a/command/agent/config.go +++ b/command/agent/config.go @@ -953,6 +953,7 @@ func DefaultConfig() *Config { Enabled: false, EnableEventBroker: helper.BoolToPtr(true), EventBufferSize: helper.IntToPtr(100), + RaftProtocol: 3, StartJoin: []string{}, ServerJoin: &ServerJoin{ RetryJoin: []string{}, diff --git a/website/content/docs/configuration/server.mdx b/website/content/docs/configuration/server.mdx index 7b4de47ee0b..751562d1886 100644 --- a/website/content/docs/configuration/server.mdx +++ b/website/content/docs/configuration/server.mdx @@ -161,7 +161,7 @@ server { required as the agent internally knows the latest version, but may be useful in some upgrade scenarios. -- `raft_protocol` `(int: 2)` - Specifies the Raft protocol version to use when +- `raft_protocol` `(int: 3)` - Specifies the Raft protocol version to use when communicating with other Nomad servers. This affects available Autopilot features and is typically not required as the agent internally knows the latest version, but may be useful in some upgrade scenarios. diff --git a/website/content/docs/upgrade/upgrade-specific.mdx b/website/content/docs/upgrade/upgrade-specific.mdx index 713fe62e46f..e9967f98919 100644 --- a/website/content/docs/upgrade/upgrade-specific.mdx +++ b/website/content/docs/upgrade/upgrade-specific.mdx @@ -13,6 +13,15 @@ upgrade. However, specific versions of Nomad may have more details provided for their upgrades as a result of new features or changed behavior. This page is used to document those details separately from the standard upgrade flow. +## Nomad 1.3.0 + +#### Default Raft Protocol Version + +In Nomad 1.3.0, the default raft protocol version has been updated +to 3. If the [`raft_protocol_version`] is not explicitly set, +upgrading a server will automatically upgrade that server's raft +protocol. See the [Upgrading to Raft Protocol 3] guide below. + ## Nomad 1.2.4 #### `nomad eval status -json` deprecated @@ -959,7 +968,7 @@ will be interpolated properly. Please see the ### Raft Protocol Version Compatibility When upgrading to Nomad 0.8.0 from a version lower than 0.7.0, users will need -to set the [`raft_protocol`](/docs/configuration/server#raft_protocol) option in +to set the [`raft_protocol`] option in their `server` stanza to 1 in order to maintain backwards compatibility with the old servers during the upgrade. After the servers have been migrated to version 0.8.0, `raft_protocol` can be moved up to 2 and the servers restarted to match @@ -1013,24 +1022,39 @@ as shown in commands like `nomad server members`. To see the version of the Raft protocol in use on each server, use the `nomad operator raft list-peers` command. -The easiest way to upgrade servers is to have each server leave the cluster, -upgrade its `raft_protocol` version in the `server` stanza, and then add it -back. Make sure the new server joins successfully and that the cluster is stable -before rolling the upgrade forward to the next server. It's also possible to -stand up a new set of servers, and then slowly stand down each of the older -servers in a similar fashion. - When using Raft protocol version 3, servers are identified by their `node-id` instead of their IP address when Nomad makes changes to its internal Raft quorum configuration. This means that once a cluster has been upgraded with servers all running Raft protocol version 3, it will no longer allow servers running any -older Raft protocol versions to be added. If running a single Nomad server, -restarting it in-place will result in that server not being able to elect itself -as a leader. To avoid this, either set the Raft protocol back to 2, or use -[Manual Recovery Using +older Raft protocol versions to be added. + +~> **Warning:** If you are running a single Nomad server, restarting it +in-place will result in that server not being able to elect itself as +a leader. To avoid this, either set the Raft protocol back to 2, or +use [Manual Recovery Using peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson) to map the server to its node ID in the Raft quorum configuration. +The easiest way to upgrade servers is to have each server leave the cluster, +upgrade its [`raft_protocol`] version in the `server` stanza, and then add it +back. Make sure the new server joins successfully and that the cluster is stable +before rolling the upgrade forward to the next server. It's also possible to +stand up a new set of servers, and then slowly stand down each of the older +servers in a similar fashion. + +For in-place raft protocol upgrades, perform the following for each server: + +* Stop the server +* Run `nomad server force-leave $server_name` +* Update the `raft_protocol` in the server's configuration file to 3. +* Restart the server +* Run `nomad operator raft list-peers` to verify that the `raft_vsn` + for the server is now 3. +* On the server, run `nomad agent-info` and check that the + `last_log_index` is of a similar value to the other servers. This + step ensures that raft is healthy and changes are replicating to the + new server. + ### Node Draining Improvements Node draining via the [`node drain`][drain-cli] command or the [drain @@ -1224,6 +1248,8 @@ deleted and then Nomad 0.3.0 can be launched. [preemption]: /docs/internals/scheduling/preemption [proxy_concurrency]: /docs/job-specification/sidecar_task#proxy_concurrency [`sidecar_task.config`]: /docs/job-specification/sidecar_task#config +[raft protocol version]: /docs/configuration/server#raft_protocol +[`raft protocol`]: /docs/configuration/server#raft_protocol [reserved]: /docs/configuration/client#reserved-parameters [task-config]: /docs/job-specification/task#config [tls-guide]: https://learn.hashicorp.com/tutorials/nomad/security-enable-tls @@ -1248,3 +1274,4 @@ deleted and then Nomad 0.3.0 can be launched. [cap_add_exec]: /docs/drivers/exec#cap_add [cap_drop_exec]: /docs/drivers/exec#cap_drop [`log_file`]: /docs/configuration#log_file +[Upgrading to Raft Protocol 3]: /docs/upgrade/upgrade-specific#upgrading-to-raft-protocol-3 From b8c48e770cff8a0b46462a3d291317445b659263 Mon Sep 17 00:00:00 2001 From: Tim Gross Date: Mon, 29 Nov 2021 08:52:51 -0500 Subject: [PATCH 2/3] recommend updating the leader last --- website/content/docs/upgrade/upgrade-specific.mdx | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/website/content/docs/upgrade/upgrade-specific.mdx b/website/content/docs/upgrade/upgrade-specific.mdx index e9967f98919..84e12344a93 100644 --- a/website/content/docs/upgrade/upgrade-specific.mdx +++ b/website/content/docs/upgrade/upgrade-specific.mdx @@ -1042,7 +1042,9 @@ before rolling the upgrade forward to the next server. It's also possible to stand up a new set of servers, and then slowly stand down each of the older servers in a similar fashion. -For in-place raft protocol upgrades, perform the following for each server: +For in-place raft protocol upgrades, perform the following for each +server, leaving the leader until last to reduce the chance of leader +elections that will slow down the process: * Stop the server * Run `nomad server force-leave $server_name` From 12c762e156dafde3af74b845936a75b0916a3e44 Mon Sep 17 00:00:00 2001 From: Tim Gross Date: Thu, 2 Dec 2021 13:56:16 -0500 Subject: [PATCH 3/3] documentation improvements from code review, changelog --- .changelog/11572.txt | 7 ++ website/content/docs/upgrade/index.mdx | 80 +++++++++++++++++++ .../content/docs/upgrade/upgrade-specific.mdx | 66 +++------------ 3 files changed, 96 insertions(+), 57 deletions(-) create mode 100644 .changelog/11572.txt diff --git a/.changelog/11572.txt b/.changelog/11572.txt new file mode 100644 index 00000000000..04921537b80 --- /dev/null +++ b/.changelog/11572.txt @@ -0,0 +1,7 @@ +```release-note:improvement +raft: The default raft protocol version is now 3. +``` + +```release-note:deprecation +Raft protocol version 2 is deprecated and will be removed in Nomad 1.4.0. +``` diff --git a/website/content/docs/upgrade/index.mdx b/website/content/docs/upgrade/index.mdx index fc16b1ce3c9..ec762aabc6c 100644 --- a/website/content/docs/upgrade/index.mdx +++ b/website/content/docs/upgrade/index.mdx @@ -153,3 +153,83 @@ differences may require specific steps. [node-status]: /docs/commands/node/status [server-members]: /docs/commands/server/members [upgrade-specific]: /docs/upgrade/upgrade-specific + +## Upgrading to Raft Protocol 3 + +This section provides details on upgrading to Raft Protocol 3. Raft +protocol version 3 requires Nomad running 0.8.0 or newer on all +servers in order to work. Raft protocol version 2 will be removed in +Nomad 1.4.0. + +To see the version of the Raft protocol in use on each server, use the +`nomad operator raft list-peers` command. + +Note that the format of `peers.json` used for outage recovery is +different when running with the latest Raft protocol. See [Manual +Recovery Using +peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson) +for a description of the required format. + +When using Raft protocol version 3, servers are identified by their +`node-id` instead of their IP address when Nomad makes changes to its +internal Raft quorum configuration. This means that once a cluster has +been upgraded with servers all running Raft protocol version 3, it +will no longer allow servers running any older Raft protocol versions +to be added. + +### Upgrading a Production Cluster to Raft Version 3 + +For production raft clusters with 3 or more memebrs, the easiest way +to upgrade servers is to have each server leave the cluster, upgrade +its [`raft_protocol`] version in the `server` stanza, and then add it +back. Make sure the new server joins successfully and that the cluster +is stable before rolling the upgrade forward to the next server. It's +also possible to stand up a new set of servers, and then slowly stand +down each of the older servers in a similar fashion. + +For in-place raft protocol upgrades, perform the following for each +server, leaving the leader until last to reduce the chance of leader +elections that will slow down the process: + +* Stop the server +* Run `nomad server force-leave $server_name` +* Update the `raft_protocol` in the server's configuration file to 3. +* Restart the server +* Run `nomad operator raft list-peers` to verify that the `raft_vsn` + for the server is now 3. +* On the server, run `nomad agent-info` and check that the + `last_log_index` is of a similar value to the other servers. This + step ensures that raft is healthy and changes are replicating to the + new server. + +### Upgrading a Single Server Cluster to Raft Version 3 + +If you are running a single Nomad server, restarting it in-place will +result in that server not being able to elect itself as a leader. To +avoid this, create a new [`raft.peers`][peers-json] file before +restarting the server with the new configuration. If you have `jq` +installed you can run the following script on the server's host to +write the correct `raft.peers` file: + +``` +#!/usr/bin/env bash + +NOMAD_DATA_DIR=$(nomad agent-info -json | jq -r '.config.DataDir') +NOMAD_ADDR=$(nomad agent-info -json | jq -r '.stats.nomad.leader_addr') +NODE_ID=$(cat "$NOMAD_DATA_DIR/server/node-id") + +cat < "$NOMAD_DATA_DIR/server/raft/peers.json" +[ + { + "id": "$NODE_ID", + "address": "$NOMAD_ADDR", + "non_voter": false + } +] +EOF +``` + +After running this script, update the `raft_protocol` in the server's +configuration to 3 and restart the server. + +[peers-json]: https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson diff --git a/website/content/docs/upgrade/upgrade-specific.mdx b/website/content/docs/upgrade/upgrade-specific.mdx index 84e12344a93..472e23d929f 100644 --- a/website/content/docs/upgrade/upgrade-specific.mdx +++ b/website/content/docs/upgrade/upgrade-specific.mdx @@ -15,12 +15,15 @@ used to document those details separately from the standard upgrade flow. ## Nomad 1.3.0 -#### Default Raft Protocol Version +#### Raft Protocol Version 2 Deprecation -In Nomad 1.3.0, the default raft protocol version has been updated -to 3. If the [`raft_protocol_version`] is not explicitly set, -upgrading a server will automatically upgrade that server's raft -protocol. See the [Upgrading to Raft Protocol 3] guide below. +Raft protocol version 2 will be removed from Nomad in the next major +release of Nomad, 1.4.0. + +In Nomad 1.3.0, the default raft protocol version has been updated to +3. If the [`raft_protocol_version`] is not explicitly set, upgrading a +server will automatically upgrade that server's raft protocol. See the +[Upgrading to Raft Protocol 3] guide. ## Nomad 1.2.4 @@ -1006,57 +1009,6 @@ In order to enable all servers in a Nomad cluster must be running with Raft protocol version 3 or later. -#### Upgrading to Raft Protocol 3 - -This section provides details on upgrading to Raft Protocol 3 in Nomad 0.8 and -higher. Raft protocol version 3 requires Nomad running 0.8.0 or newer on all -servers in order to work. See [Raft Protocol Version -Compatibility](/docs/upgrade/upgrade-specific#raft-protocol-version-compatibility) -for more details. Also the format of `peers.json` used for outage recovery is -different when running with the latest Raft protocol. See [Manual Recovery Using -peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson) -for a description of the required format. - -Please note that the Raft protocol is different from Nomad's internal protocol -as shown in commands like `nomad server members`. To see the version of the Raft -protocol in use on each server, use the `nomad operator raft list-peers` -command. - -When using Raft protocol version 3, servers are identified by their `node-id` -instead of their IP address when Nomad makes changes to its internal Raft quorum -configuration. This means that once a cluster has been upgraded with servers all -running Raft protocol version 3, it will no longer allow servers running any -older Raft protocol versions to be added. - -~> **Warning:** If you are running a single Nomad server, restarting it -in-place will result in that server not being able to elect itself as -a leader. To avoid this, either set the Raft protocol back to 2, or -use [Manual Recovery Using -peers.json](https://learn.hashicorp.com/tutorials/nomad/outage-recovery#manual-recovery-using-peersjson) -to map the server to its node ID in the Raft quorum configuration. - -The easiest way to upgrade servers is to have each server leave the cluster, -upgrade its [`raft_protocol`] version in the `server` stanza, and then add it -back. Make sure the new server joins successfully and that the cluster is stable -before rolling the upgrade forward to the next server. It's also possible to -stand up a new set of servers, and then slowly stand down each of the older -servers in a similar fashion. - -For in-place raft protocol upgrades, perform the following for each -server, leaving the leader until last to reduce the chance of leader -elections that will slow down the process: - -* Stop the server -* Run `nomad server force-leave $server_name` -* Update the `raft_protocol` in the server's configuration file to 3. -* Restart the server -* Run `nomad operator raft list-peers` to verify that the `raft_vsn` - for the server is now 3. -* On the server, run `nomad agent-info` and check that the - `last_log_index` is of a similar value to the other servers. This - step ensures that raft is healthy and changes are replicating to the - new server. - ### Node Draining Improvements Node draining via the [`node drain`][drain-cli] command or the [drain @@ -1276,4 +1228,4 @@ deleted and then Nomad 0.3.0 can be launched. [cap_add_exec]: /docs/drivers/exec#cap_add [cap_drop_exec]: /docs/drivers/exec#cap_drop [`log_file`]: /docs/configuration#log_file -[Upgrading to Raft Protocol 3]: /docs/upgrade/upgrade-specific#upgrading-to-raft-protocol-3 +[Upgrading to Raft Protocol 3]: /docs/upgrade#upgrading-to-raft-protocol-3