-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad job using template stanza permanently dead after transient consul errors #2623
Comments
Have you filtered the logs in some way? It is very hard to tell what is happening? |
@dadgar I am now monitoring that service to know the exact time it stops working. Next week I can then find out from the logs exactly what happened as it happens every week. |
@dadgar OK, like clockwork it happened again :) The services are down again since 06:48:15 UTC
nginx should be running on nomad-loadbalancer, but it is "dead"; also scorecard is not running. A From what I see there could have been transient network problems for a few minutes at that time. Probably consul returned 500s because it could not reach any server. Some logs on nomad-loadbalancer0 (time is in utc):
Nomad did not log anything else on that host after 06:47:46 UTC Any other ideas where to look for what went wrong? |
I can confirm there was a transient network issue at that time and the virtual machines were probably isolated from each other for up to 10 minutes. |
nginx and scorecard services have in common that they are using consul_template service discovery and / or key value store functionality. All other services don't do that. |
the consul cluster is now healthy by the way, but wasn't at the time the service stopped. Expected behaviour is
also I'd expect recovery once consul returns to functioning state even if 2 is not the case ... |
This is a no-go for production IMO |
@dadgar sorry for my last comment, not very helpful of me to complain .... do you have enough info to understand what's going on? |
@stefreak Thanks for the report. I understand what is happening now. We have a default retry rate that is not exposed. For 0.6.0 I would like to expose the retry rate so it can be extended and optionally set to retry indefinitely. |
Solution is to expose the |
OK, thanks, makes sense. What I don't understand is why it's not the default to retry indefinitely, maybe with a maximum retry interval like 5 minutes? IMO transient errors should not result in permanent loss of service ... Also, failing to contact consul should not kill the job – it does not if I use consul_template directly. |
The defaults of consul-template, together with a traditional setup like with systemd's If a nomad job fails for any reason, nomad (at least with restart mode delay) will also try to restart it forever, which is the default and IMO should also be the default for the template stanza. |
@dadgar exposing retry will not be enough to fix all instances of this, now I encountered another instance of this issue with my patched version of nomad (retries forever) For some reason it still stopped retrying... I think I did not see the message (background is the VM had stalled CPUs and I/O issues because of a hypervisor problem)
|
@dadgar this time restarting nomad resolved the issue what can I do to get more debug info next time? |
@stefreak Your Consul and Nomad cluster lost quorum. The retry behavior would also have covered this:
Restarting Nomad was just coincidence as the problem with the template was Consul queries were failing because of lack of quorum. |
@dadgar I think this is not covered, because this was using my patched version of nomad with unlimited retry on all servers and clients: https://github.com/stefreak/nomad/tree/v0.5.6-patched |
@stefreak At least in the logs provided, it does not appear like the consul-template gave up on retrying. it was just that it could not render because Consul did not have a leader. |
@dadgar I captured the logs one day later and there were no more log entries ( So somehow nomad locked up and the job was down, I think for another reason. Not sure if I can reproduce this behaviour though, if I have a lot of time some day I will try by simulating a very slow / locked up storage. |
I believe that I've run into this issue as well. Nomad 0.5.6. Consul 0.8.5. I updated my test cluster yesterday to Consul 0.8.5 and noticed my service jobs died around that same time. The previous allocations have all been GCd this morning, so I just have the remote logs to go on. The logs show attempts with retries to the health and KV APIs all reaching maximum retries at/around 5s:
We regularly upgrade Consul to each point release. We've had these service jobs end up dead like this in the past, but haven't correlated it to the consul agent restart. I'll watch for it more closely on future upgrades. This is also a test cluster with smaller consul and nomad instances and the baseline raft performance settings for those smaller systems. The larger production systems may be able to restart the local consul agent within the ~5s seconds worth of retries currently available. |
I just did some tests for this on a nomad worker in my test cluster and can reproduce the issue. With 2 task groups of the same application running on 2 different workers, introduce consul agent availability issues on one of the workers.... Test 1: Restart the local consul agent (using systemd "restart") Test 2: Stop the local consul agent, wait ~7s, start the local consul agent This job is a "service" job and does not currently specify a "restart" stanza. I'm looking at the restart stanza docs and I think the default restart policy should apply here. |
I tried adding a "restart" stanza to the task groups in my test job, but that doesn't change the behavior. The "Test 2" scenario still fails tasks every time. I've also noticed that the "Test 1" scenario will sometimes fail tasks depending on how long it takes the consul agent to come back up. @stefreak @dadgar Let me know if there's anything else I can test or repro for this. Having the consul-template retry controls available would be great for tuning certain jobs, but I'd also like to understand if nomad should be restarting these tasks once the local consul agent comes up again. Should the nomad agent's overall consul retry options be configurable (more than 5 retries / 5 seconds, etc)?? Should the tasks be killed in the first place because the consul agent is down? Mine use the docker driver, if it matters. |
Just to update, I would like to get this fixed in 0.6.1 |
@dadgar does that mean you are working on it right now? Just asking so we don't duplicate the effort |
@stefreak Nope haven't started yet. Busy with polishing 0.6.0 and have other things I will be personally tackling in 0.6.1 |
Hit this issue but with a slightly different behaviour. I have couple |
Worth mention I'm on Nomad v0.8.0 |
Any updates on this? I had a full Nomad outage because we lost Consul leadership and all the jobs running entered in the failed state because Nomad was unable to read from the Consul K/V to render the templates. It would be nice to have a way to tell Nomad to do nothing to running allocations if template rendering fails. |
@danlsgiga We've managed to pinpoint the code that causes this in nomad. See my comments from a couple days ago on #11209 In short: No way to prevent a full task outage, if you depend on restart or signal change_mode for any of your templates. Changing it to noop should prevent that, but may be undesirable if you need template reloading. |
Reopening per user request (@jcdyer). This comment in #11209 contains the code in question and this comment contains a suggested change in behavior. We'll look into the suggested change and consider the impact. |
Closed by #13041 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
Operating system and Environment details
Ubuntu 16.04 LTS
Issue
Every week my job reproducably turns "dead":
I am not sure yet what the cause is, that's why I provide so much info and logs here.
I suspect it could be connected with the fact that all nodes in the nomad cluster reboot regularly, automatically.
Nomad cluster looks like this:
servicehosts are consul and nomad servers, the rest are consul and nomad clients.
Reproduction steps
Not sure yet
Nomad Server & Client logs (if appropriate)
see attachment
scorecard.log.txt
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: