-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relocating tasks from filed leader node takes 5 minutes #1747
Comments
Hi @dadgar , have updates to this issue. I made additional tests and found that issue is reproducing 100% when we kill intentionally current leader in nomad peers list. |
Updated task. Hope this news data helps. Would be great to have it fixed in 0.5 release. Reproducible in 100% cases |
Hi, I really need to fix this issue, have only several days and can't wait v0.5.1 release. I spent some time and found the reason:
And this value is happens to equal 5 minutes. @dadgar @diptanu Do you have suggestions other that tuning FailoverHeartbeatTTL to something like 30-40 seconds? I can afford downtime 1 minute at max. |
@capone212 This wouldn't be a problem if you didn't run the Server in Client mode. If you tune that value down you are much more likely to get false positives and kill tasks that didn't need to be killed. |
Hi @dadgar, thanks for reply. I agree, that makes sense. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
v0.4.1
Operating system and Environment details
3 Windows OS nodes in local network
Issue
I have 3 nodes with nomad and consul. nomad with both server and client mode. When I cold down leader node, consul members and nomad server-members show that node is down after 10-15 seconds. but nomad node-status detects that node is down only after 5m+. And everything is ok after 5m. This is a problem for me because nomad will try to move dead tasks only after 5m.
Reproduction steps
3 nodes nodes with server and client mode.
exec "nomad server-members" to found out which node is current leader.
Cold down leader node. Monitor tasks states and confirm that task is moved to healthy node only after 5m.
Attaching logs from remained nomad nodes in cluster.
https://gist.github.com/capone212/fd2c1bd2a1218c0eca94aaff84ac0e2b
https://gist.github.com/capone212/11510c357a1ea50f8134b1f8699cfdbb
From first link you can see events:
The following is first error log after other node is down
And the following is when task relocation was done by nomad
The text was updated successfully, but these errors were encountered: