Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Consul template: expose retry configuration, fixes #2623 #2665

Closed

Conversation

stefreak
Copy link

Retry forever, fixes #2623

I was thinking about first having hashicorp/consul-template#939 merged, but actually there is not really a need to couple this.

This is WIP because I am first testing this patch in the nomad cluster where I am experiencing the issue regularly.

@stefreak
Copy link
Author

stefreak commented May 22, 2017

I've now tested this in my cluster with success, a consul outage (e.g. no cluster leader) does not cause existing nomad tasks to be stopped with this patch

before this patch nomad only tried 5 times and then killed the job permanently, even after consul recovered.

@stefreak stefreak changed the title [WIP] Consul template: retry forever, fixes #2623 Consul template: retry forever, fixes #2623 May 22, 2017
@stefreak
Copy link
Author

stefreak commented May 23, 2017

Because timeouts need to be counted as well, the actual time until the job is killed is about 7 minutes in my test...

@dadgar
Copy link
Contributor

dadgar commented May 23, 2017

@stefreak What do you think about exposing the retry behavior in the job spec and instead of having the individual template retry forever we make it a restartable error so that it counts against the restart policy.

The reason I think this may be better is there could be a partition such that you may only want to retry on a single node for an hour and then want it restarted and placed on another node. This would mitigate the case where a consul agent was misbehaving or that node was partitioned from Consul.

@stefreak
Copy link
Author

stefreak commented May 23, 2017

@dadgar I agree this would also make hashicorp/consul-template#939 obsolete in this context, but I'm not sure yet exactly how to implement that, need to dig further into the code

@dadgar
Copy link
Contributor

dadgar commented May 24, 2017

@stefreak I think we would still want that stuff to allow a high retry count but non infinite with a max time.

Gonna paste some links here for you to get a sense of the code base:

Adding the names to the job spec:

Internal structs: https://github.com/hashicorp/nomad/blob/master/nomad/structs/structs.go#L2916
Api structs: https://github.com/hashicorp/nomad/blob/master/api/tasks.go#L329
Parsing code: https://github.com/hashicorp/nomad/blob/master/jobspec/parse.go#L851

With that done you will more or less have access to it in the client:
Setting the retries: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L393
Killing the task without making it a failure: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L216

There we need to parse the error and change the kill to be a non-fail kill!

@stefreak
Copy link
Author

stefreak commented May 24, 2017

@dadgar I will have a look, thanks for the help, you're awesome :)

Do you think it would still be worth deviating from the consul_template retry defaults? For me it was pretty counter intuitive that a consul outage has cascading effects to the availability of tasks using the template stanza, and even with non-fail kill the task would be pending after it was killed, until consul is available again.

On the other hand I do understand you considering the possibility of a partition or just the local consul agent failing.

Question is what is the default trade off:

  • higher availability + stale templates during partition or consul outage

or

  • always-fresh template + no availability during full consul outage

I think I'd go for availability by default, because I get exactly that behaviour with a traditional consul_template + systemd setup (but I can live with either of them)

@dadgar
Copy link
Contributor

dadgar commented May 24, 2017

@stefreak I think once we expose the retry tunables, the job submitter can control that tradeoff. If you are okay with stale you pick a high or infinite retry attempts, if not you bound the retries and let nomad restart the task waiting for fresh configs.

@stefreak stefreak changed the title Consul template: retry forever, fixes #2623 [WIP] Consul template: expose retry configuration, fixes #2623 May 25, 2017
@sethvargo
Copy link
Contributor

Hey @stefreak

CT is released and tagged with those new options. Thank you again for the work and the PR. I think we'll need to update the vendored library here and consume/expose those new options.

@stefreak
Copy link
Author

@sethvargo cool, i will start working on it as soon as i find some time

@dadgar
Copy link
Contributor

dadgar commented Jun 23, 2017

@stefreak Should we close this in the meantime?

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad job using template stanza permanently dead after transient consul errors
3 participants