-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FailoverHeartbeatTTL to config #11127
Conversation
FailoverHeartbeatTTL is the amount of time to wait after a server leader failure before considering reallocating client tasks. This TTL should be fairly long as the new server leader needs to rebuild the entire heartbeat map for the cluster. In deployments with a small number of machines, the default TTL (5m) may be unnecessary long. Let's allow operators to configure this value in their config files.
Bump. Any chance this can be looked at? This would be very useful for my use case. |
… failover_heartbeat_ttl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this work @mukerjee! And apologies for the delay on getting it reviewed.
I pushed a commit with a changelog entry and to highlight the potential risks with modifying this configuration.
Excellent! Thank you @lgfa29 ! No worries about the delay. I see this is marked for v1.2.0. Any timeframe for that release yet? |
No hard dates yet, but soon 🙂 |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
FailoverHeartbeatTTL is the amount of time to wait after a server leader failure
before considering reallocating client tasks. This TTL should be fairly long as
the new server leader needs to rebuild the entire heartbeat map for the
cluster. In deployments with a small number of machines, the default TTL (5m)
may be unnecessary long. Let's allow operators to configure this value in their
config files.
In our use case we have a small number of machines (e.g., 7) in the same physical rack, connected with redundant networking (multiple NICs, multiple switches). It is prohibitively expensive for us to dedicate machines to being only nomad servers (which would make FailoverHeartbeatTTL less impactful). In this use case, if heartbeats haven't been responded to within e.g., 30s, the machine is almost definitely failed in some way. No need to wait for 5m.
This relates to #1747 where it was requested that
MinHeartbeatTTL
andFailoverHeartbeatTTL
would be made configurable. Since then,MinHeartbeatTTL
has already been made configurable. This PR makesFailoverHeartbeatTTL
configurable.