-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocations not stopping after modifying a job #6877
Comments
Hi @cjonesy and thanks for reporting this! I was able to reproduce using these job files in a local development environment. After running the job update:
After running the job stop:
Relevant window of the logs:
This log entry looks promising for figuring out what happened here: |
With a bit of debugging I've determined at this point that the task runner loop itself seems to be ok, and the scheduler is sending the update w/ a desired status of stop to the allocation as we'd expect. The "discarding allocation update" log message is coming from What I've found is that when we submit the update for With that information in hand and some trace-level debugging I've narrowed it down to when we call into |
Just chiming in here and seconding this issue. This happens even with the Consul Connect example wiht the count-api and count-dashboard, and I've had this happen with other jobs as well. Not only do allocations not always stop, but the alloc directory in nomad's folder structure remains, which seems to make Nomad re-register any services to consul as well, making it impossible to get rid of it. I also tried stopping Nomad and forcefully rm -rf:ing that alloc folder on the host, but the /secrets folder in that directory is busy and the command exits with an error "Device or resource busy: .../secrets" |
Fix a bug where consul service definitions would not be updated if changes were made to the service in the Nomad job. Currently this only fixes the bug for cases where the fix is a matter of updating consul agent's service registration. There is related bug where destructive changes are required (see #6877) which will be fixed in another PR. The enable_tag_override configuration setting for the parent service is applied to the sidecar service. Fixes #6459
We just ran into this, this is really bad. The declared/reported state should always always match the runtime state.. isn't this the point of nomad? I had a short look at the implementation, it seems there are several sources of truth for whether an allocation should exist(bad!), and these leaking allocations are not cleaned up from the boltdb state properly. So far I've been a fan of nomad, but leaking processes... we don't have many agents atm so we can clean up the mess manually, but what will happen when we scale to hundreds of nodes? |
Small update: clearing the boltdb entries (while the agent is stopped ofc) also works, which is less pesky than trying to remove the allocation directories with busy mounted volumes (they will be picked up by gc). |
i have the same problem, but no access to the nodes themselves. is there any way to solve this without restarting the client? |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.10.2 (0d2d6e3)
Operating system and Environment details
CentOS Linux 7
Issue
When making certain changes to a jobfile and re-running the job, allocations can get stuck in a running state. I believe all allocations should be stopped when you run
nomad stop example
but that does not happen after some jobfile changes.Reproduction steps
nomad run example.nomad
nomad run example_modified.nomad
nomad stop example
nomad job status example
sudo systemctl restart nomad
nomad job status example
Job file (if appropriate)
Initial jobfile:
Modified Jobfile
(added port to redis1 network, then referenced it in service)
Nomad Client logs (if appropriate)
Nomad Client Node 1
Nomad Client Node 3
Nomad Server logs (if appropriate)
Nothing of interest was logged on the servers during this time.
The text was updated successfully, but these errors were encountered: