Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relocating tasks from filed leader node takes 5 minutes #1747

Closed
capone212 opened this issue Sep 26, 2016 · 6 comments
Closed

Relocating tasks from filed leader node takes 5 minutes #1747

capone212 opened this issue Sep 26, 2016 · 6 comments

Comments

@capone212
Copy link
Contributor

capone212 commented Sep 26, 2016

Nomad version

Output from nomad version
v0.4.1

Operating system and Environment details

3 Windows OS nodes in local network

Issue

I have 3 nodes with nomad and consul. nomad with both server and client mode. When I cold down leader node, consul members and nomad server-members show that node is down after 10-15 seconds. but nomad node-status detects that node is down only after 5m+. And everything is ok after 5m. This is a problem for me because nomad will try to move dead tasks only after 5m.

Reproduction steps

3 nodes nodes with server and client mode.
exec "nomad server-members" to found out which node is current leader.
Cold down leader node. Monitor tasks states and confirm that task is moved to healthy node only after 5m.

Attaching logs from remained nomad nodes in cluster.
https://gist.github.com/capone212/fd2c1bd2a1218c0eca94aaff84ac0e2b
https://gist.github.com/capone212/11510c357a1ea50f8134b1f8699cfdbb

From first link you can see events:
The following is first error log after other node is down

2016/09/26 13:10:51 [ERR] raft-net: Failed to decode incoming command: read tcp 10.0.37.170:4647->10.0.36.71:12185: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/09/26 13:10:51.236711 [DEBUG] consul.syncer: error in syncing: 2 error(s) occurred:

* Get http://127.0.0.1:8500/v1/agent/services: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
* Get http://127.0.0.1:8500/v1/agent/checks: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

And the following is when task relocation was done by nomad


    2016/09/26 13:15:34.815484 [DEBUG] http: Request /v1/job/ngprunner.ArturNode2/evaluations (2.0011ms)
    2016/09/26 13:15:34.819487 [DEBUG] http: Request /v1/job/ngprunner.ArturNode2 (999.7µs)
    2016/09/26 13:15:34.825492 [DEBUG] http: Request /v1/job/ngprunner.ArturNode2/allocations (992.7µs)
    2016/09/26 13:15:35.636425 [DEBUG] worker: dequeued evaluation 1f6bb610-f270-0fb9-44d1-d6696ff94fd1
    2016/09/26 13:15:35.636425 [DEBUG] sched.core: job GC: scanning before index 0 (4h0m0s)
    2016/09/26 13:15:35.636425 [DEBUG] worker: ack for evaluation 1f6bb610-f270-0fb9-44d1-d6696ff94fd1
    2016/09/26 13:15:35.777054 [DEBUG] worker: dequeued evaluation d0a2c4c3-4e40-d3d1-a0ac-07a2563a6239
    2016/09/26 13:15:35.890481 [DEBUG] sched: <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 1)
true v1:8268655238334704494
true v1:6388994533039628171
    2016/09/26 13:15:35.999859 [DEBUG] worker: created evaluation <Eval 'e45bf4bb-8725-359d-8e9b-8a484e12fdae' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>
    2016/09/26 13:15:35.999859 [DEBUG] sched: <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>: failed to place all allocations, blocked eval 'e45bf4bb-8725-359d-8e9b-8a484e12fdae' created
    2016/09/26 13:15:36.062360 [DEBUG] worker: submitted plan for evaluation d0a2c4c3-4e40-d3d1-a0ac-07a2563a6239
    2016/09/26 13:15:36.062360 [DEBUG] sched: <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>: setting status to complete
    2016/09/26 13:15:36.124863 [DEBUG] worker: updated evaluation <Eval 'd0a2c4c3-4e40-d3d1-a0ac-07a2563a6239' JobID: 'pgrunner.ciJZ7ZVDjL.EYSC20160111B'>
    2016/09/26 13:15:36.140474 [DEBUG] worker: ack for evaluation d0a2c4c3-4e40-d3d1-a0ac-07a2563a6239
@capone212
Copy link
Contributor Author

Hi @dadgar , have updates to this issue. I made additional tests and found that issue is reproducing 100% when we kill intentionally current leader in nomad peers list.

@capone212 capone212 changed the title Relocating tasks from filed node takes 5 minutes Relocating tasks from filed LEADER node takes 5 minutes Sep 30, 2016
@capone212 capone212 changed the title Relocating tasks from filed LEADER node takes 5 minutes Relocating tasks from filed leader node takes 5 minutes Sep 30, 2016
@capone212
Copy link
Contributor Author

Updated task. Hope this news data helps. Would be great to have it fixed in 0.5 release. Reproducible in 100% cases

@capone212
Copy link
Contributor Author

capone212 commented Oct 6, 2016

Hi, I really need to fix this issue, have only several days and can't wait v0.5.1 release. I spent some time and found the reason:
After electing new leader nomad uses FailoverHeartbeatTTL in method initializeHeartbeatTimers (file heartbeat.go). Here is the comment from config field

    // FailoverHeartbeatTTL is the TTL applied to heartbeats after
    // a new leader is elected, since we no longer know the status
    // of all the heartbeats.
    FailoverHeartbeatTTL time.Duration

And this value is happens to equal 5 minutes.
Problematic behavior is apparent now: after election new leader failed nodes can be detected only after 5 minutes.

@dadgar @diptanu Do you have suggestions other that tuning FailoverHeartbeatTTL to something like 30-40 seconds? I can afford downtime 1 minute at max.

@dadgar
Copy link
Contributor

dadgar commented Oct 6, 2016

@capone212 This wouldn't be a problem if you didn't run the Server in Client mode. If you tune that value down you are much more likely to get false positives and kill tasks that didn't need to be killed.

@capone212
Copy link
Contributor Author

Hi @dadgar, thanks for reply. I agree, that makes sense.
I am closing this issue, but it would be great to have FailoverHeartbeatTTL and MinHeartbeatTTL configurable.

@tgross tgross modified the milestones: near-term, unscheduled Jan 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants