-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orphaned Health Check #1524
Comments
Hmm, Sorry I am having a hard time following. What you are saying is that Consul has a health check registered that does not correspond to any Nomad service running on that machine? |
@dadgar correct i'm seeing the above health check still firing even after the nomad job/task group has gone away. Here is another example from another nomad client in the same region:
|
@jshaw86 Is the check registered in Consul? And are those logs from nomad or consul? |
What type of check is this? Is it a script check? |
@jshaw86 Is this reproducible? When you say you "tore the jobs down serially" how did you do that? If you let the host run for a while, does the check deregister eventually after 4hrs? |
@sean- I am able to reproduce it consistently on our staging env by scheduling and tearing down a nomad job every 10s one at a time(serially) and also running nomad gc once a minute on system cron on a nomad sever |
@jshaw86 When this happens next time, can you share the result of |
So I just ran: @jshaw86 It would be great next time if you got into this condition to get the consul agent logs, the nomad client logs, and the checks diptanu mentioned. If they are sensitive you can email them to me. |
We are still running into this intermittently here are the client logs. |
We just had another one, these are the most concise logs I can provide:
|
I have seen similar behavior from 0.3.x and 0.4.x. I am not sure if it is a consul or nomad issue, but something like |
Getting the output of |
@ketzacoatl Did you by chance change the tags on the service? |
@dadgar It might be possible, but it's highly unlikely (I rarely change them once set). Also, if I ask consul to deregister the service, consul should remove the entry, no? |
If anyone is hitting this could you report the |
@dadgar I did check this a while back, the process was cleaned up seems like it was just on the consul end but will try to repo. It's possible that I could have raced something because after 5 minutes it does seem to get cleaned up. |
@jshaw86 hmm, we run a periodic reaping of all checks/services that shouldn't be there every 30 seconds. There was an issue with the executor crashing that we have fixed in master that may have been adding an artificial delay |
@jshaw86 It honestly may just be Consuls own reconcile with the servers. It may be gone from the agent but the remote servers have it because of a short outage etc |
@dadgar ok have some updated information. Another engineer has seen the executor still stick around when the scheduler says the job has been killed, and the health check still exists because we can route to the job in fabio so you'll have to take our word for it till I'm able to replicate this again. According to other engineers we have only seen this in production lately. This is the logs from our latest experience with the issue from last week (no ps aux or /v1/agent/checks though) sorry.
|
@dadgar, I wasn't able to get to CLI in time to get a ps or curl the agent checks but I can confirm the executor and process are still running as the transformation I was doing in the request was still getting returned, then after 5 minutes it goes away. Will try to catch it again when I see it it's just very hard to react in time. :) |
just an update we are going to upgrade to consul 0.7, as there seemed to be some health check fixes in there, and monitor to see if that fixes the issue. I'm not 100% sure this will do it as some of the nomad jobs were still running after being told to stop. |
Changed the milestone not to indicate that we won't continue to work on this but we have larger plans on refactoring the consul code that should make it more robust. This is a larger piece of work though so want the milestone to reflect it. |
@dadgar does this mean ya'll have root causes this down to the nomad implementation of consul or just speculting? |
Root cause hasn't been determined but I wouldn't chalk it up to speculation either. The current implementation is complex and has low debug ability as consul syncing occurs in various processes and if they ever get out of sync they are prone to flap. The new design will be centralized, simpler and much easier to debug. |
@dadgar oh ok. So when we upgrade to 0.7 you wouldn't expect much difference in behavior? Is there anything we should look for after we upgrade in the logs besides the ps and v1 health check call you recommended? |
@jshaw86 Yeah I would try to capture the logs, ps aux, agent and |
@dadgar once this is delivered in 0.6.0 is the idea that it will be fixed or that we will be able to gather meaningful logs to then address the issue? I noticed some changelog statements on the consul end that may have helped the stability as well so were you guys thinking that it could be a consul issue or combination of consul and nomad. |
@jshaw86 I am pretty sure it is a Nomad issue but as mentioned lack of debug ability makes it difficult to be certain. It is likely that 0.5.4/0.5.5 will bring the fix and it will be much easier to provide useful logs since the nomad client itself will be managing the registration/deregistration rather than the executor. |
Just an update I created a generic nomad service with Count of 1000 with an http health check on consul 0.7.2 with nomad 0.4.2 and seeing the same connection refused orphans in syslog:
Will repeat with 0.5.3 and hopefully 0.6.0 fixes this. |
@jshaw86 Can you break down the steps: 1 Nomad agent? Bring a single job with a health check on a dynamic port to count 1000 and then stop the job and then theres orphans? |
In production it's hundreds of jobs with Count < 3, three nomad servers, many nomad agents, federated six regions. This particular case it was one job with Count > 1000, single nomad server four nomad clients. Both should reproduce the issue. Steps: |
(Should have closed with my last comment; sorry for the delay) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
nomad 0.4.0
Operating system and Environment details
14.04
Issue
This maybe a consul issue but coming through on the nomad agent, but it seems that a health check can get into a state where the health check endpoint is checked indefinitely but not in the registered.
Reproduction steps
Not completely sure but we scheduled many jobs(~100) in serial and tore them down serially and eventually got to this state.
Nomad Client logs (if appropriate)
I can share the result of
curl localhost:8500/v1/health/state/any
to show there is no health check under this IP/Port offline as there is some sensitive information in there.UPDATE: I got it out of this state by restarting consul
The text was updated successfully, but these errors were encountered: