-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade #7320
Comments
We have also seen this problem in our testing with v0.10.4, and don't fully understand the situation but it's definitely a problem. Basically, no deployment will ever fail due to health checks currently. We have a job whose task will never get healthy, yet for some reason, Nomad always passes the deployment. Consul properly reports the task as unhealthy. Another big problem related to this is that it seems the We can mitigate this by setting the During testing, we've set the task's check's Solid report @dpn, this regression seems pretty dire. Thank you @kainoaseto and @tydomitrovich for finding and helping to troubleshoot. |
Hi guys, apologies if I'm inflating the priority for this issue, but it seems pretty serious that we cannot depend on health checks of allocations during deployments. Could we get confirmation that this issue has been acknowledged and is being prioritized (hopefully on the higher side)? |
@dpn @djenriquez This seems very bad indeed. I'll be investigating this now and will post updates when I get an understanding of the underlying issue and if there are any mitigating factors. Thank you very much for the detailed and clear reproducibility steps. |
Thanks again for the issue. It's indeed very serious - it affects virtually all deployments and affects nomad versions as old as 0.8.0, but I believe earlier. It affects deployments where One workaround is to increase min_healthy_time to be higher than possible restart delays. I'm working on the fix and aim to have it ready later this week. |
Thanks @notnoop, really appreciate you digging into this. Do you think this will be backported to the 0.9 and 0.10 series of releases? I know we're lagging behind by being on 0.9 but we'll be finishing up our 0.10 validation soon and plan to migrate over once that's complete. |
Thank you @notnoop for looking into this and the workaround in the meantime! I will look at implementing that fix in our job's for our 0.10 clusters to mitigate this bug and will watch for the fix later this week. |
Hi @notnoop and anyone else that runs into this before the release of the fix in 0.11.0. I was able to test the mitigation by changing the
Thanks for the workaround! Below is some sample configuration in case anyone else runs into the same thing:
|
Thanks @notnoop for the quick fix! |
I experience the same behavior with 0.11.3. Nomad does not wait until the current alloc's become healthy before restart the next ones. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Found on
Also repros on these version in our test clusters
Operating system and Environment details
Originally found in AWS:
Reproduced on colocated hardware:
Issue
The issue was discovered when one of our engineers pushed out a deployment where the replacement allocs were failing their healthchecks due to improperly configured Security Groups in AWS, yet Nomad continued to replace the healthy allocs with unhealthy ones until the entire service was down.
In the repro steps it seems that Nomad thinks these replacement allocations are healthy when they're not and this seems to be triggered when the replacement alloc is restarted by the service
CheckRestart
stanza. Another thing to note is that this doesn't reproduce with a single task job, multiple tasks are required for this behavior.I'm not seeing any issues with the config that would lead to this behavior, but entirely possible I've overlooked something.
Reproduction steps
Submit stable job
Wait for initial deployment to succeed:
Modify the job file to tweak the cpu allocation (this forces a new deployment, simulating a Docker image version bump) and break the healthcheck on one of the allocations by tweaking the healthcheck path
Submit the updated job
Deployment begins by creating a new alloc
CheckRestart
stanza takes effect, restarting the new alloc:Nomad schedules a new allocation with the new job spec and tears down one of the old allocations, essentially continuing the deployment even though the healthchecks on the new allocs are still unhealthy. This is the behavior we're confused about:
This continues until all healthy allocs are gone, replaced by unhealthy ones (although Nomad incorrectly thinks they're healthy):
State of new allocs after deployment completes:
Final state of deployment
Job file (if appropriate)
I've left off other logs as I think the repro steps are sufficient and this reproduces 100% of the time in our setup, but happy to gather some if necessary.
The text was updated successfully, but these errors were encountered: