-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Consul template: expose retry configuration, fixes #2623 #2665
Conversation
I've now tested this in my cluster with success, a consul outage (e.g. no cluster leader) does not cause existing nomad tasks to be stopped with this patch before this patch nomad only tried 5 times and then killed the job permanently, even after consul recovered. |
Because timeouts need to be counted as well, the actual time until the job is killed is about 7 minutes in my test... |
@stefreak What do you think about exposing the retry behavior in the job spec and instead of having the individual template retry forever we make it a restartable error so that it counts against the restart policy. The reason I think this may be better is there could be a partition such that you may only want to retry on a single node for an hour and then want it restarted and placed on another node. This would mitigate the case where a consul agent was misbehaving or that node was partitioned from Consul. |
@dadgar I agree this would also make hashicorp/consul-template#939 obsolete in this context, but I'm not sure yet exactly how to implement that, need to dig further into the code |
@stefreak I think we would still want that stuff to allow a high retry count but non infinite with a max time. Gonna paste some links here for you to get a sense of the code base: Adding the names to the job spec: Internal structs: https://github.com/hashicorp/nomad/blob/master/nomad/structs/structs.go#L2916 With that done you will more or less have access to it in the client: There we need to parse the error and change the kill to be a non-fail kill! |
@dadgar I will have a look, thanks for the help, you're awesome :) Do you think it would still be worth deviating from the consul_template retry defaults? For me it was pretty counter intuitive that a consul outage has cascading effects to the availability of tasks using the template stanza, and even with non-fail kill the task would be pending after it was killed, until consul is available again. On the other hand I do understand you considering the possibility of a partition or just the local consul agent failing. Question is what is the default trade off:
or
I think I'd go for availability by default, because I get exactly that behaviour with a traditional consul_template + systemd setup (but I can live with either of them) |
@stefreak I think once we expose the retry tunables, the job submitter can control that tradeoff. If you are okay with stale you pick a high or infinite retry attempts, if not you bound the retries and let nomad restart the task waiting for fresh configs. |
Hey @stefreak CT is released and tagged with those new options. Thank you again for the work and the PR. I think we'll need to update the vendored library here and consume/expose those new options. |
@sethvargo cool, i will start working on it as soon as i find some time |
@stefreak Should we close this in the meantime? |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Retry forever, fixes #2623
I was thinking about first having hashicorp/consul-template#939 merged, but actually there is not really a need to couple this.
This is WIP because I am first testing this patch in the nomad cluster where I am experiencing the issue regularly.