-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad not cleaning consul services/healthcheck after failed deploy [0.9.x] #5770
Comments
Hi, this happened to me too. I was able to reproduce it consistently by doing the following:
Use a job (similar to the one provided by @MagicMicky
In the first run it won't get inserted in consul, so it works fine, but if we change something in the job (like the memory for example) and run it again, it will be registered and remain there. If we keep modifiing something that makes the task run again, new checks appears |
Hi again, I launched nomad with TRACE logging and saw the following:
In the end the service is still registered with 2 checks. I guess this has to do with the service updatehook being called. |
We are seeing this as well |
We were able to reproduce this behavior as well using this simple job:
The docker image does not exist, so the job continues to restart. With each restart, we get another service entry in Consul. (Because there is no defined check, they all show up as healthy services in Consul.)
The problem does not show up if we remove the Service entries in Consul are cleaned up if the Nomad client is restarted on the node where the job was attempting to run. |
We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770
@endocrimes -- thank you so much! |
Is this merged into the 0.9.4 RC? |
This was released in 0.9.3 (https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#093-june-12-2019) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Noticed on 0.9.0. Problem present on 0.9.2-rc1 and 0.9.1 as well
Operating system and Environment details
Running on Ubuntu
Nomad single node, both client and server
Consul single node, both client and server (v1.4.4) - tested with v1.5.1 as well
Vault single node (v1.1.0)
Tried with and without ACLs enabled (both nomad and consul)
Issue
When provisioning a job, that contains a template, referencing to a vault secret, the following actions happen:
Reproduction steps
nomad run my-job.nomad
(see file given below).Job file (if appropriate)
This has been tested and reproduced with the following job file:
Also reproduced with a custom upgrade/reschedule policy, which lead to two allocation in total, and to two failing healthchecks registered in consul. (We could match the consul HC with nomad allocations via the port registered.)
Logs
Not much information was found in the logs.
Same for consul, logs are clean.
nomad.log
Consul configured:
Nomad configured:
The text was updated successfully, but these errors were encountered: