-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task is restarted on the same host even if it is not necessary #7607
Comments
Hi @samo1 and thanks for reporting this! I want to make sure I understand the scenario correctly...
One of the servers is powering off and this is causing one of the clients to restart jobs? Or are you saying that while one of the servers is powered off, one of the client nodes happens to kill a job task (for some other reason, like the task failed)? Or do you mean one of the client nodes is getting restarted? If it were the client node restarting, I would expect that jobs would be migrated off it once the client is declared dead by the server. If there's no place for the tasks to go, they'd hang out waiting for a placement. Once a client is declared dead, the Nomad server can't know whether it's running the correct workloads anymore (which is why we're working on xxx). So in that case the Nomad server tells the Nomad client to restart from scratch. |
Hi, To clarify - this sentence describes the scenario perfectly: One of the servers is powering off and this is causing one of the clients to restart jobs. During this scenario, the client node is still running. The node was not restarted or affected in any way. Tasks that run on the client node are also not failed. The tasks were restarted by Nomad. |
Oh, that shouldn't be happening at all! We'll look into this. |
Ok, so after a bit more digging into your logs, I think this is another example of #6212. (See below for a walkthrough.) I've been discussing with @langmartin recently how we want to implement stopping tasks when client nodes are lost (#2185). While this isn't exactly the same case, there's no real difference from the client's perspective as to whether it simply can't communicate with the server or the server is dead. I'm going to link this issue and #6212 in to that issue so that he can make sure we're considering this case as well there. Let's dig through the logs. On the server, we see one of the server members go marked as down:
The client eventually times out its connection to this server:
The client immediately tries to bootstrap again:
The client tells the servers "I'm still alive" and gets a new server list, but note that we've logged that we think we probably missed a heartbeat:
The server sends an allocation update because it thought the node was lost:
|
@tgross Could potentially be a bug in the server list management.
The client knew about 2 other servers that it should have switched over to sending RPCs to. Looks like it hadn’t heartbeated in 43s. It should have been retrying during that time! https://github.com/hashicorp/nomad/blob/master/client/client.go#L1522 After a failed RPC we should rotate to the next server: https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L85 |
We're going to be shipping #2185 shortly as part of 0.11.2, which will resolve this issue. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.10.2 (0d2d6e3)
Operating system and Environment details
CentOS Linux release 7.7.1908 (Core)
on VMware vSphere
Issue
Task is restarted on the same host even if it is not necessary.
We have jobs with long startup time, so we try to minimize restarts.
This issue happens quite often in our system tests:
Reproduction steps
Topology: 3 Nomad + Consul servers, 4 client hosts.
Nomad Jobs: about 10 jobs total. We observe this issue in two jobs which run on two client hosts (2 + 2 allocations).
Job file
Nomad Client logs
Log shows the task is killed. Later a new start is performed.
Nomad Server logs
Parts of server logs which looks most relevant:
The text was updated successfully, but these errors were encountered: