-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement raft multipler flag #8082
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! the tests seem to cover everything.
return nil, fmt.Errorf("raft_multiplier cannot be %d. Must be between 1 and %d", *agentConfig.Server.RaftMultiplier, MaxRaftMultiplier) | ||
} | ||
} | ||
conf.RaftConfig.ElectionTimeout *= time.Duration(raftMultiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding: we've already parsed ElectionTimeout from the config, so that an operator can combine a specific ElectionTimeout and the multiplier, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost but not quite. ElectionTimeout
value is set earlier from the raft library hardcoded value. Without RaftMultipler
, operators don't have direct way to manipulate the value in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left an idea for documentation improvement, but it looks good otherwise!
@@ -141,6 +141,14 @@ server { | |||
features and is typically not required as the agent internally knows the | |||
latest version, but may be useful in some upgrade scenarios. | |||
|
|||
- `raft_multiplier` `(int: 1)` - An integer multiplier used by Nomad servers to | |||
scale key Raft timing parameters. Omitting this value or setting it to 0 uses | |||
default timing described below. Lower values are used to tighten timing and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"described below" - we don't describe it below.
I don't think this documentation from Consul works well for us because they set their raft_multiplier
to 5 while we set it to the minimum value of 1. There's no way to adjust Nomad to be more sensitive to faults.
If it's ok that Nomad cannot be tuned to be more sensitive (we're really locking ourselves in for the long term with this design), and I think that's ok, then maybe the docs should read:
raft_multiplier
(int: 1)
- An integer multiplier used by Nomad servers to scale key Raft timing parameters similar to Consul'sraft_multiplier
. Unlike Consul, Nomad defaults to a multiplier of1
, so timings can only be increased. Increasing the timings makes leader election less likely during periods of networking issues or resource starvation. Since leader elections pause Nomad's normal work, it may be beneficial for slow or unreliable networks to wait longer before electing a new leader. The tradeoff when raising this value is that during network partitions or other events (server crash) where a leader is lost, Nomad will not elect a new leader for a longer period of time than the default. Thenomad.nomad.leader.barrier'
metric is a good indicator of how often leader elections occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I'll follow up. To clarify, Nomad's raft_multiplier=1 matches Consul's raft_multiplier=1 - i.e. nomad's minimum is consul's minimum. When Consul introduced raft_multipler flag in 0.7, they additionally changed the default timeouts to make it looser to accommodate dev instances - here, I didn't change the default, hence it stays at 1 - the equivalent of pre-0.12 values.
Does this mean it can be removed from Consul? |
Consul's default raft_multiplier is still 5. So if you would like Consul's raft to use the highest speed possible, you would still need to set it to 1 in your Consul configuration. Adding the setting to Nomad allows you to use a lower-performance timing that's suitable for minimal servers. This can facilitate the use of smaller class server instances for development clusters. The trade-off being that lower values of raft-multiplier are used to tighten timing and increase sensitivity while higher values relax timings and reduce sensitivity. Tuning this affects the time it takes Nomad to detect leader failures and to perform leader elections, at the expense of requiring more network and CPU resources for better performance. Because Nomad has always used a multiplier of 1, it was retained as the default. This setting can only be used to relax Nomad's default sensitivity. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This ports
raft_multiplier
flag from Consul to Nomad: https://www.consul.io/docs/agent/options.html#raft_multiplier .It allows users to set tweak raft related timeouts and leases, which is important in low/moderate performance network or disks. Consul has some guides for raft_multipliers in https://www.consul.io/docs/install/performance.html - not sure how it fully maps to nomad yet, but at least we can expose the knob and give recommendations later.
This doesn't change the defaults, and operators can increase the defaults. I infer that the current defaults are adequate, as we haven't gotten many requests or support issues related to raft defaults.
Closes #5872 .