Relocating tasks from filed leader node takes 5 minutes #1747

capone212 · 2016-09-26T12:46:02Z

Nomad version

Output from nomad version
v0.4.1

Operating system and Environment details

3 Windows OS nodes in local network

Issue

I have 3 nodes with nomad and consul. nomad with both server and client mode. When I cold down leader node, consul members and nomad server-members show that node is down after 10-15 seconds. but nomad node-status detects that node is down only after 5m+. And everything is ok after 5m. This is a problem for me because nomad will try to move dead tasks only after 5m.

Reproduction steps

3 nodes nodes with server and client mode.
exec "nomad server-members" to found out which node is current leader.
Cold down leader node. Monitor tasks states and confirm that task is moved to healthy node only after 5m.

Attaching logs from remained nomad nodes in cluster.
https://gist.github.com/capone212/fd2c1bd2a1218c0eca94aaff84ac0e2b
https://gist.github.com/capone212/11510c357a1ea50f8134b1f8699cfdbb

From first link you can see events:
The following is first error log after other node is down

2016/09/26 13:10:51 [ERR] raft-net: Failed to decode incoming command: read tcp 10.0.37.170:4647->10.0.36.71:12185: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/09/26 13:10:51.236711 [DEBUG] consul.syncer: error in syncing: 2 error(s) occurred:

* Get http://127.0.0.1:8500/v1/agent/services: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
* Get http://127.0.0.1:8500/v1/agent/checks: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

And the following is when task relocation was done by nomad


    2016/09/26 13:15:34.815484 [DEBUG] http: Request /v1/job/ngprunner.ArturNode2/evaluations (2.0011ms)
    2016/09/26 13:15:34.819487 [DEBUG] http: Request /v1/job/ngprunner.ArturNode2 (999.7µs)
    2016/09/26 13:15:34.825492 [DEBUG] http: Request /v1/job/ngprunner.ArturNode2/allocations (992.7µs)
    2016/09/26 13:15:35.636425 [DEBUG] worker: dequeued evaluation 1f6bb610-f270-0fb9-44d1-d6696ff94fd1
    2016/09/26 13:15:35.636425 [DEBUG] sched.core: job GC: scanning before index 0 (4h0m0s)
    2016/09/26 13:15:35.636425 [DEBUG] worker: ack for evaluation 1f6bb610-f270-0fb9-44d1-d6696ff94fd1
    2016/09/26 13:15:35.777054 [DEBUG] worker: dequeued evaluation d0a2c4c3-4e40-d3d1-a0ac-07a2563a6239
    2016/09/26 13:15:35.890481 [DEBUG] sched: <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 1)
true v1:8268655238334704494
true v1:6388994533039628171
    2016/09/26 13:15:35.999859 [DEBUG] worker: created evaluation <Eval 'e45bf4bb-8725-359d-8e9b-8a484e12fdae' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>
    2016/09/26 13:15:35.999859 [DEBUG] sched: <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>: failed to place all allocations, blocked eval 'e45bf4bb-8725-359d-8e9b-8a484e12fdae' created
    2016/09/26 13:15:36.062360 [DEBUG] worker: submitted plan for evaluation d0a2c4c3-4e40-d3d1-a0ac-07a2563a6239
    2016/09/26 13:15:36.062360 [DEBUG] sched: <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>: setting status to complete
    2016/09/26 13:15:36.124863 [DEBUG] worker: updated evaluation <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>
    2016/09/26 13:15:36.140474 [DEBUG] worker: ack for evaluation d0a2c4c3-4e40-d3d1-a0ac-07a2563a6239

The text was updated successfully, but these errors were encountered:

capone212 · 2016-09-30T11:57:56Z

Hi @dadgar , have updates to this issue. I made additional tests and found that issue is reproducing 100% when we kill intentionally current leader in nomad peers list.

capone212 · 2016-09-30T12:05:24Z

Updated task. Hope this news data helps. Would be great to have it fixed in 0.5 release. Reproducible in 100% cases

capone212 · 2016-10-06T15:47:35Z

Hi, I really need to fix this issue, have only several days and can't wait v0.5.1 release. I spent some time and found the reason:
After electing new leader nomad uses FailoverHeartbeatTTL in method initializeHeartbeatTimers (file heartbeat.go). Here is the comment from config field

    // FailoverHeartbeatTTL is the TTL applied to heartbeats after
    // a new leader is elected, since we no longer know the status
    // of all the heartbeats.
    FailoverHeartbeatTTL time.Duration

And this value is happens to equal 5 minutes.
Problematic behavior is apparent now: after election new leader failed nodes can be detected only after 5 minutes.

@dadgar @diptanu Do you have suggestions other that tuning FailoverHeartbeatTTL to something like 30-40 seconds? I can afford downtime 1 minute at max.

dadgar · 2016-10-06T16:36:50Z

@capone212 This wouldn't be a problem if you didn't run the Server in Client mode. If you tune that value down you are much more likely to get false positives and kill tasks that didn't need to be killed.

capone212 · 2016-10-06T17:22:30Z

Hi @dadgar, thanks for reply. I agree, that makes sense.
I am closing this issue, but it would be great to have FailoverHeartbeatTTL and MinHeartbeatTTL configurable.

github-actions · 2022-11-14T02:32:22Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/scheduling stage/needs-investigation labels Sep 26, 2016

dadgar added this to the v0.5.1 milestone Sep 26, 2016

capone212 changed the title ~~Relocating tasks from filed node takes 5 minutes~~ Relocating tasks from filed LEADER node takes 5 minutes Sep 30, 2016

capone212 changed the title ~~Relocating tasks from filed LEADER node takes 5 minutes~~ Relocating tasks from filed leader node takes 5 minutes Sep 30, 2016

dadgar removed the stage/needs-investigation label Oct 6, 2016

capone212 closed this as completed Oct 6, 2016

tgross modified the milestones: near-term, unscheduled Jan 9, 2020

mukerjee mentioned this issue Sep 2, 2021

Add FailoverHeartbeatTTL to config #11127

Merged

github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relocating tasks from filed leader node takes 5 minutes #1747

Relocating tasks from filed leader node takes 5 minutes #1747

capone212 commented Sep 26, 2016 •

edited

Loading

capone212 commented Sep 30, 2016

capone212 commented Sep 30, 2016

capone212 commented Oct 6, 2016 •

edited

Loading

dadgar commented Oct 6, 2016

capone212 commented Oct 6, 2016

github-actions bot commented Nov 14, 2022

Relocating tasks from filed leader node takes 5 minutes #1747

Relocating tasks from filed leader node takes 5 minutes #1747

Comments

capone212 commented Sep 26, 2016 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

capone212 commented Sep 30, 2016

capone212 commented Sep 30, 2016

capone212 commented Oct 6, 2016 • edited Loading

dadgar commented Oct 6, 2016

capone212 commented Oct 6, 2016

github-actions bot commented Nov 14, 2022

capone212 commented Sep 26, 2016 •

edited

Loading

capone212 commented Oct 6, 2016 •

edited

Loading