-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad has running deployments for non existant jobs #4520
Comments
Thanks for reporting this. Can you please include further steps on how to reproduce this issue? For example, a sample job config, how the job was stopped/garbage collected, and the amount of time before you started to see this issue occur would all be helpful. |
Thanks for taking a look at this so quickly. Unfortunately, I do not have reliable steps to reproduce this issue, as it was recently found in our infrastructure, although in the next week or so I will hopefully have time to investigate this as well as #4299 which I suspect might be related, as both result in components of a job persisting after the job has ended and been garbage collected. In both cases there are 'orphaned' components of a job present that seem to reference the job that the belong to, but might not be referenced by the job that owns them. It should be easy enough to automatically detect when this has happened, I'll get a script posted here to do that when I get to it. The job was most likely stopped via the nomad cli, but I can't verify 100% that because it happened quite a while ago. The job was stopped on March 7th, however the oldest error like this I can find in my logs is from yesterday at 16:47 UTC. I don't believe that this is a log retention issue, so I will try to determine what could have triggered these errors to start occurring. They have been frequent since they started. This is also happening on several other jobs, all with almost identical configs except for the customer/region Here is the final version of that job file before it was stopped:
|
Hey in some earlier releases of Nomad we had a data race were multiple deployments could be created. This has since been remedied and 0.8.4 includes a fix to remove the leaked deployments: #4329 So please upgrade to 0.8.4. If this issue still is there, let us know and we can re-open |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad has running deployments listed for jobs which were stopped several months ago and have long since been garbage collected. It should not be possible to have running deployments for non-existent jobs, they should have been canceled when the job was stopped and garbage collected.
Nomad Version:
Nomad v0.8.3 (c85483da3471f4bd3a7c3de112e95f551071769f)
This causes error messages in the nomad server logs such as:
The text was updated successfully, but these errors were encountered: