-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed node's allocations not rescheduled #2958
Comments
Hey @ebdb, Would you mind sharing the server logs when this happens and if you can the output of |
For example when I kill nodes, I see the server detect the TTL and replace the nodes allocations else where (killed 3 out of 6 nodes and see 3 ttls):
Further
I wonder if you just do not have enough resources left over the remaining clients for Nomad to be able to reschedule the work. |
The status for the service job:
And the logs from the cluster leader:
The scheduler logs information about the two system jobs (fabio, authentication), but does not even mention the service job. As far as resource are concerned, all of the hosts have resources to spare, such as this node:
Or this node:
The task is requesting:
|
@ebdb Interesting. Thanks for following up. Does this happen reliably for you, as in every time a node is drained with a service? |
@ebdb If this happens again can you paste the output of |
@ebdb Is it also possible to get more of the logs and the output of |
@ebdb Feel free to ignore all requests! Figured it out |
This PR allows the scheduler to replace lost allocations even if the job has a failed or paused deployment. The prior behavior was confusing to users. Fixes #2958
Hey just to update this issue. Nomad was handling the node failures for service jobs but it was not replacing allocations when the job's most recent deployment was "Failed". |
I noticed something that seems strange to me. When I have a job with all of its allocations running and passing consul health checks, I see this as the deployment's status:
Does that seem right to you? Should I open a new issue for that? |
@ebdb The |
The health checks are passing on their first run and stay that way, so I am not sure why the deployment wouldn't pass right away. Switching Just to rule it out, is there a minimum consul version this works with? |
@ebdb You can set your log level to trace and look at the client logs to see why it is failing in the mean time. All even remotely recent versions of Consul should be fine! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.6.0
Operating system and Environment details
Ubuntu 16.04
2 nodes running Consul/Nomad Server (this is a dev environment and we want to easily test loss of quorum scenarios, hence the even number)
4 nodes running Consul/Nomad Client
Issue
Allocations on failed clients are not rescheduled on other clients. When a node reconnects to the cluster, the Jobs running on that node are stopped. Only system Jobs are restarted on that node. These last two behaviors are expected. However, in v0.5.6, allocations on failed clients were rescheduled on other clients.
Reproduction steps
Other Notes
Initially, this environment was upgraded from 0.5.6 to 0.6.0. To rule out an upgrade issue, I stopped all of the Nomad services, cleaned out their data folders and started a new cluster. The issue persists on the new cluster.
The text was updated successfully, but these errors were encountered: