-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Old script health checks come back when upgrading a cluster #6815
Comments
Thanks for reporting this @cwalsh! There were some changes to the script check registration process in 0.10.0 to support task group services and Consul Connect, so that seems like a likely candidate to start our investigation into this bug. |
Just as a quick update here (I work with @cwalsh), this doesn't appear to be specific to an upgrade. It appears that under some conditions that we're yet to identify, a new duplicate check is registered in Consul by Nomad, with a different check ID. Nomad stops checking the existing check, which causes the TTL to expire, and the service go critical. I believe at this point, we've ruled out the upgrade because we've performed the usual node drain, data directory purge, dance. The only thing I've been able to think of is that under some conditions a new check ID is generated, maybe based on a hash of the check which is missing a couple of properties/has different values to what was used to determine the check ID earlier. We haven't been able to trace it back to a specific check configuration. |
This is still an issue in 0.10.4 |
@tgross (sorry for the direct @), Jimmy is also part of my team and has been the one that's spent the most time on this of late, even with Nomad v0.10.4. I don't think this bug is directly related to upgrades anymore, as new containers scheduled exhibit the same problem, even if you reset a node. Unfortunately we don't have any great reproduction instructions yet -- do you have any suggestions on where we might go from here or what might be useful? |
Can you provide an example jobspec where this happens? |
Thanks Tim - I emailed an example Jobspec to the nomad-oss-debug email address, also added a little more context. Many thanks in advance! |
Thanks @jimmybigcommerce, that's a big help. I've had a look at the jobspec JSON you sent and I've pulled out a redacted version and turned it into HCL which I think illustrates the issue: service {
port = "myport"
check {
type = "script"
command = "/usr/lib/check_http"
args = [
"-I",
"localhost",
"-p",
"${NOMAD_PORT_myport}",
"container",
]
interval = "30s"
timeout = "15s"
}
} As @chrisboulton pointed out above, we use a hash of the service and check to register the IDs:
In 0.10.0 we made changes to support group-level services, which included a rework of script checks so that they could be defined at the group level but executed inside a task container. But there was a bug in this where services/checks that used interpolation would be registered before interpolation, and then executed after interpolation. In 0.10.2 I fixed this bug ( #6586). But I missed adding the interpolation to script checks, and so this was fixed in 0.10.4 (#6916). However, there's still a lurking gotcha. If we do interpolation before registering and interpolation, then checks that have interpolation that varies on a per allocation basis will end up with different checks to register for the same service. Which is exactly the case here we have with using the We definitely should fix this before and make it more intuitive... I think I want to pull in some colleagues (maybe @nickethier and @schmichael?) for their thoughts as this section of code seems to be a bit of a bug farm. In the meantime as a workaround, note that the script check is being executed inside the task container. So you should be able to make the script check directly to the port that the application is listening on, rather than to the |
@tgross Thanks a bunch for tracking this down! The explanation makes a bunch of sense especially after a couple of scours over the code in this area I've made of late. @cwalsh: Tim's workaround here is probably relatively trivial for us to make on our end, and it's a good callout given we know the actual value of the port we need to check against. A change there on our would let us forge ahead with the Nomad upgrade. Did we want to go down that path for now? I have a feeling the reason we're checking against the mapped port is purely a leftover from days gone by. |
Thank you @tgross greatly appreciated! |
When upgrading a Nomad client in a cluster from 0.9.4 to 0.10.1, we see allocations end up with two script checks registered in Consul against them - the new one which passes, and an old one which fails. Allocations running or started on the 0.9.4 nodes in the cluster continue to function normally.
Nomad version
All server nodes: Nomad v0.10.1 (829f9af35c77d564b3dab74454eeba9bf25e2df8)
Some client nodes: Nomad v0.10.1 (829f9af35c77d564b3dab74454eeba9bf25e2df8)
Remaining client nodes: Nomad v0.9.4 (a81aa84)
Operating system and Environment details
Debian Jessie
Issue
Old script-based health checks come back, a problem similar to #5824. It seems to only happen with script checks.
Reproduction steps
Start with a cluster running 0.9.4, with a number of jobs with health check scripts.
Upgrade the server nodes to 0.10.1
Drain a client node
Upgrade nomad on the drained client node to 0.10.1
Restart the nomad job to allow it to run on the newly upgraded node
Wait a few minutes and watch the old script health checks get re-registered against the new allocations.
I can't replicate this in development mode with a single combined server/client, but it reliably happens on our clusters.
The text was updated successfully, but these errors were encountered: