Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task is restarted on the same host even if it is not necessary #7607

Closed
samo1 opened this issue Apr 3, 2020 · 7 comments
Closed

Task is restarted on the same host even if it is not necessary #7607

samo1 opened this issue Apr 3, 2020 · 7 comments

Comments

@samo1
Copy link

samo1 commented Apr 3, 2020

Nomad version

Nomad v0.10.2 (0d2d6e3)

Operating system and Environment details

CentOS Linux release 7.7.1908 (Core)
on VMware vSphere

Issue

Task is restarted on the same host even if it is not necessary.

We have jobs with long startup time, so we try to minimize restarts.
This issue happens quite often in our system tests:

  1. Nomad server host is powered off hard
  2. We observe that Nomad triggers evaluation internally after the server is lost
  3. We can see Nomad client kills the task and then starts it again on the same host

Reproduction steps

Topology: 3 Nomad + Consul servers, 4 client hosts.
Nomad Jobs: about 10 jobs total. We observe this issue in two jobs which run on two client hosts (2 + 2 allocations).

  1. Power off hard one of the Nomad server hosts.
  2. One of the client nodes kills the job tasks and starts them again immediately.

Job file

job "matcher0" {
  type = "service"
  group "matcher" {
    reschedule { delay = "30s", delay_function = "constant", unlimited = true }
    count = 2
    constraint {
      attribute = "10.12.4.201,10.12.4.193"
      operator = "set_contains"
      value = "${attr.unique.network.ip-address}"
    }
    constraint {
      distinct_hosts = true
    }
    task "matcher-numa0" {
      driver = "raw_exec"
      kill_timeout = "30s"
      config {
        command = "/bin/bash"
        args = ["-c", "exec ./out"]
      }
      resources {
        network {
          port "MATCHER" {
            static = 25780
          }
        }
      }
      service {
        name = "${TASK}"
        port = "MATCHER"
        check {
          name = "Matcher Check"
          type = "tcp"
          interval = "5s"
          timeout = "3s"
        }
      }
    }
  }
}

Nomad Client logs

Log shows the task is killed. Later a new start is performed.

Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.738+0200 [ERROR] client: yamux: keepalive failed: i/o deadline reached
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.738+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.12.4.200:4647
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.738+0200 [ERROR] client: error querying node allocations: error="rpc error: EOF"
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.738+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=10.12.4.200:4647
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.738+0200 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.682371739s
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.738+0200 [DEBUG] client.server_mgr: error pinging server: error="rpc error: EOF" server=10.12.4.200:4647
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.742+0200 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[dc1]
Mar 30 11:57:04 preview-11c nomad[17078]: 2020-03-30T11:57:04.745+0200 [INFO]  client.consul: discovered following servers: servers=[10.12.4.200:4647, 10.12.4.193:4647, 10.12.4.201:4647]
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.277+0200 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=179.019µs
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.499+0200 [DEBUG] client: evaluations triggered by node update: num_evals=2
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.499+0200 [TRACE] client: next heartbeat: period=19.440951493s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.499+0200 [DEBUG] client: state updated: node_status=ready
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.499+0200 [WARN]  client: missed heartbeat: req_latency=78.853454ms heartbeat_ttl=13.554540118s since_last_heartbeat=43.86834076s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.500+0200 [DEBUG] client.server_mgr: new server list: new_servers=[10.12.4.193:4647, 10.12.4.201:4647] old_servers=[10.12.4.201:4647, 10.12.4.193:4647, 10.12.4.200:4647]
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.503+0200 [TRACE] client: next heartbeat: period=13.782266138s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.503+0200 [DEBUG] client: updated allocations: index=189 total=2 pulled=2 filtered=0
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.503+0200 [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=0
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.503+0200 [TRACE] client: updating alloc: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc index=189
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.505+0200 [TRACE] client: next heartbeat: period=18.81660875s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.511+0200 [TRACE] client: updating alloc: alloc_id=3234395e-e51d-1b26-832c-dd7d3a857844 index=186
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.511+0200 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc task=matcher-numa1 event_type=Killing event_reason=
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.511+0200 [TRACE] client.alloc_runner.task_runner: running pre kill hooks: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc task=matcher-numa1 start="2020-03-30 11:57:06.511719484 +0200 CEST m=+316.851034908"
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.511+0200 [TRACE] client.alloc_runner.task_runner: running prekill hook: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc task=matcher-numa1 name=consul_services start="2020-03-30 11:57:06.511759221 +0200 CEST m=+316.851074642"
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.511+0200 [TRACE] client.alloc_runner.task_runner: finished prekill hook: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc task=matcher-numa1 name=consul_services end="2020-03-30 11:57:06.511839901 +0200 CEST m=+316.851155318" duration=80.676µs
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.511+0200 [TRACE] client.alloc_runner.task_runner: finished pre kill hooks: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc task=matcher-numa1 end="2020-03-30 11:57:06.511856966 +0200 CEST m=+316.851172392" duration=137.484µs
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.515+0200 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=2 ignored=0 errors=0
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.564+0200 [TRACE] client: next heartbeat: period=17.538887124s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.564+0200 [DEBUG] client: updated allocations: index=198 total=4 pulled=1 filtered=3
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.565+0200 [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=3
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.565+0200 [TRACE] client: next heartbeat: period=17.843415895s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.567+0200 [TRACE] client.alloc_runner.task_runner: finished prestart hook: alloc_id=e228f582-27df-1308-7cfd-73450462951d task=matcher-numa1 name=validate end="2020-03-30 11:57:06.567131009 +0200 CEST m=+316.906446450" duration=7.399973ms
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.567+0200 [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=e228f582-27df-1308-7cfd-73450462951d task=matcher-numa1 name=task_dir start="2020-03-30 11:57:06.567175574 +0200 CEST m=+316.906490998"
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.573+0200 [TRACE] client.alloc_runner: handling task state update: alloc_id=e228f582-27df-1308-7cfd-73450462951d done=false
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.577+0200 [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=3 errors=0
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.579+0200 [TRACE] client.alloc_runner: running pre-run hooks: alloc_id=5140d3db-32a7-996f-015c-bd62d8ed3fce start="2020-03-30 11:57:06.579728851 +0200 CEST m=+316.919044301"

Nomad Server logs

Parts of server logs which looks most relevant:

Mar 30 11:56:32 preview-6c nomad[17090]: 2020-03-30T11:56:32.404+0200 [DEBUG] nomad: memberlist: Failed ping: preview-5c.global (timeout reached)
Mar 30 11:56:32 preview-6c nomad[17090]: 2020-03-30T11:56:32.735+0200 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=591.471µs
Mar 30 11:56:34 preview-6c nomad[17090]: 2020-03-30T11:56:34.404+0200 [INFO]  nomad: memberlist: Suspect preview-5c.global has failed, no acks received

Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.217+0200 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=1f749461-dd62-b42b-ac2a-7fe109eecd66 job_id=matcher1 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 1)
Mar 30 11:56:46 preview-6c nomad[17090]: Desired Changes for "matcher": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 1) (canary 0)"
Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.232+0200 [DEBUG] worker: created evaluation: eval="<Eval "c1619b41-a7fb-b8cc-a7f7-249883423cba" JobID: "matcher1" Namespace: "default">"
Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.232+0200 [DEBUG] worker.service_sched: failed to place all allocations, blocked eval created: eval_id=1f749461-dd62-b42b-ac2a-7fe109eecd66 job_id=matcher1 namespace=default blocked_eval_id=c1619b41-a7fb-b8cc-a7f7-249883423cba
Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.237+0200 [DEBUG] worker: submitted plan for evaluation: eval_id=1f749461-dd62-b42b-ac2a-7fe109eecd66
Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.237+0200 [DEBUG] worker.service_sched: setting eval status: eval_id=1f749461-dd62-b42b-ac2a-7fe109eecd66 job_id=matcher1 namespace=default status=complete
Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.246+0200 [DEBUG] worker: updated evaluation: eval="<Eval "1f749461-dd62-b42b-ac2a-7fe109eecd66" JobID: "matcher1" Namespace: "default">"
Mar 30 11:56:46 preview-6c nomad[17090]: 2020-03-30T11:56:46.246+0200 [DEBUG] client: updated allocations: index=189 total=0 pulled=0 filtered=0
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.196+0200 [DEBUG] worker: dequeued evaluation: eval_id=76faedb7-78a0-1044-1799-812878196617
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.196+0200 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=76faedb7-78a0-1044-1799-812878196617 job_id=matcher0 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 1)
Mar 30 11:56:46 preview-7c nomad[17058]: Desired Changes for "matcher": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 1) (canary 0)"
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.200+0200 [DEBUG] worker: created evaluation: eval="<Eval "c9e5d345-ec56-7c2b-b512-ccbb142dd44f" JobID: "matcher0" Namespace: "default">"
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.200+0200 [DEBUG] worker.service_sched: failed to place all allocations, blocked eval created: eval_id=76faedb7-78a0-1044-1799-812878196617 job_id=matcher0 namespace=default blocked_eval_id=c9e5d345-ec56-7c2b-b512-ccbb142dd44f
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.204+0200 [DEBUG] worker: submitted plan for evaluation: eval_id=76faedb7-78a0-1044-1799-812878196617
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.204+0200 [DEBUG] worker.service_sched: setting eval status: eval_id=76faedb7-78a0-1044-1799-812878196617 job_id=matcher0 namespace=default status=complete
Mar 30 11:56:46 preview-7c nomad[17058]: 2020-03-30T11:56:46.208+0200 [DEBUG] client: updated allocations: index=186 total=0 pulled=0 filtered=0
@tgross
Copy link
Member

tgross commented Apr 7, 2020

Hi @samo1 and thanks for reporting this! I want to make sure I understand the scenario correctly...

  1. Power off hard one of the Nomad server hosts.
  2. One of the client nodes kills the job tasks and starts them again immediately.

One of the servers is powering off and this is causing one of the clients to restart jobs? Or are you saying that while one of the servers is powered off, one of the client nodes happens to kill a job task (for some other reason, like the task failed)? Or do you mean one of the client nodes is getting restarted?

If it were the client node restarting, I would expect that jobs would be migrated off it once the client is declared dead by the server. If there's no place for the tasks to go, they'd hang out waiting for a placement. Once a client is declared dead, the Nomad server can't know whether it's running the correct workloads anymore (which is why we're working on xxx). So in that case the Nomad server tells the Nomad client to restart from scratch.

@samo1
Copy link
Author

samo1 commented Apr 8, 2020

Hi,
thanks for looking at this issue.

To clarify - this sentence describes the scenario perfectly: One of the servers is powering off and this is causing one of the clients to restart jobs.

During this scenario, the client node is still running. The node was not restarted or affected in any way. Tasks that run on the client node are also not failed. The tasks were restarted by Nomad.

@tgross
Copy link
Member

tgross commented Apr 8, 2020

One of the servers is powering off and this is causing one of the clients to restart jobs.

Oh, that shouldn't be happening at all! We'll look into this.

@tgross
Copy link
Member

tgross commented Apr 8, 2020

Ok, so after a bit more digging into your logs, I think this is another example of #6212. (See below for a walkthrough.)

I've been discussing with @langmartin recently how we want to implement stopping tasks when client nodes are lost (#2185). While this isn't exactly the same case, there's no real difference from the client's perspective as to whether it simply can't communicate with the server or the server is dead. I'm going to link this issue and #6212 in to that issue so that he can make sure we're considering this case as well there.


Let's dig through the logs. On the server, we see one of the server members go marked as down:

preview-6c nomad[17090]: 2020-03-30T11:56:32.404+0200 [DEBUG] nomad: memberlist: Failed ping: preview-5c.global (timeout reached)
preview-6c nomad[17090]: 2020-03-30T11:56:32.735+0200 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=591.471µs
preview-6c nomad[17090]: 2020-03-30T11:56:34.404+0200 [INFO]  nomad: memberlist: Suspect preview-5c.global has failed, no acks received

The client eventually times out its connection to this server:

preview-11c nomad[17078] 2020-03-30T11:57:04.738+0200 [ERROR] client: yamux: keepalive failed: i/o deadline reached
preview-11c nomad[17078] 2020-03-30T11:57:04.738+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.12.4.200:4647
preview-11c nomad[17078] 2020-03-30T11:57:04.738+0200 [ERROR] client: error querying node allocations: error="rpc error: EOF"
preview-11c nomad[17078] 2020-03-30T11:57:04.738+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=10.12.4.200:4647
preview-11c nomad[17078] 2020-03-30T11:57:04.738+0200 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.682371739s
preview-11c nomad[17078] 2020-03-30T11:57:04.738+0200 [DEBUG] client.server_mgr: error pinging server: error="rpc error: EOF" server=10.12.4.200:4647

The client immediately tries to bootstrap again:

preview-11c nomad[17078] 2020-03-30T11:57:04.742+0200 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[dc1]
preview-11c nomad[17078] 2020-03-30T11:57:04.745+0200 [INFO]  client.consul: discovered following servers: servers=[10.12.4.200:4647, 10.12.4.193:4647, 10.12.4.201:4647]
preview-11c nomad[17078] 2020-03-30T11:57:06.277+0200 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=179.019µs
preview-11c nomad[17078] 2020-03-30T11:57:06.499+0200 [DEBUG] client: evaluations triggered by node update: num_evals=2

The client tells the servers "I'm still alive" and gets a new server list, but note that we've logged that we think we probably missed a heartbeat:

preview-11c nomad[17078] 2020-03-30T11:57:06.499+0200 [TRACE] client: next heartbeat: period=19.440951493s
preview-11c nomad[17078] 2020-03-30T11:57:06.499+0200 [DEBUG] client: state updated: node_status=ready
preview-11c nomad[17078] 2020-03-30T11:57:06.499+0200 [WARN]  client: missed heartbeat: req_latency=78.853454ms heartbeat_ttl=13.554540118s since_last_heartbeat=43.86834076s
preview-11c nomad[17078] 2020-03-30T11:57:06.500+0200 [DEBUG] client.server_mgr: new server list: new_servers=[10.12.4.193:4647, 10.12.4.201:4647] old_servers=[10.12.4.201:4647, 10.12.4.193:4647, 10.12.4.200:4647]
preview-11c nomad[17078] 2020-03-30T11:57:06.503+0200 [TRACE] client: next heartbeat: period=13.782266138s

The server sends an allocation update because it thought the node was lost:

preview-11c nomad[17078] 2020-03-30T11:57:06.503+0200 [DEBUG] client: updated allocations: index=189 total=2 pulled=2 filtered=0
preview-11c nomad[17078] 2020-03-30T11:57:06.503+0200 [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=0
preview-11c nomad[17078] 2020-03-30T11:57:06.503+0200 [TRACE] client: updating alloc: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc index=189
preview-11c nomad[17078] 2020-03-30T11:57:06.505+0200 [TRACE] client: next heartbeat: period=18.81660875s
preview-11c nomad[17078] 2020-03-30T11:57:06.511+0200 [TRACE] client: updating alloc: alloc_id=3234395e-e51d-1b26-832c-dd7d3a857844 index=186
preview-11c nomad[17078] 2020-03-30T11:57:06.511+0200 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=5d995c34-4a2e-a136-1724-145ecc3b76cc task=matcher-numa1 event_type=Killing event_reason=

@dadgar
Copy link
Contributor

dadgar commented Apr 15, 2020

@tgross Could potentially be a bug in the server list management.

Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.499+0200 [WARN]  client: missed heartbeat: req_latency=78.853454ms heartbeat_ttl=13.554540118s since_last_heartbeat=43.86834076s
Mar 30 11:57:06 preview-11c nomad[17078]: 2020-03-30T11:57:06.500+0200 [DEBUG] client.server_mgr: new server list: new_servers=[10.12.4.193:4647, 10.12.4.201:4647] old_servers=[10.12.4.201:4647, 10.12.4.193:4647, 10.12.4.200:4647]

The client knew about 2 other servers that it should have switched over to sending RPCs to. Looks like it hadn’t heartbeated in 43s. It should have been retrying during that time!

https://github.com/hashicorp/nomad/blob/master/client/client.go#L1522

After a failed RPC we should rotate to the next server: https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L85

@tgross tgross modified the milestones: 0.11.1, 0.11.2 Apr 22, 2020
@tgross tgross modified the milestones: 0.11.2, 0.11.3 May 13, 2020
@tgross
Copy link
Member

tgross commented May 13, 2020

We're going to be shipping #2185 shortly as part of 0.11.2, which will resolve this issue.

@tgross tgross closed this as completed May 13, 2020
@tgross tgross modified the milestones: 0.11.3, 0.11.2 May 13, 2020
@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants