[WIP] Consul template: expose retry configuration, fixes #2623 #2665

stefreak · 2017-05-22T20:49:02Z

Retry forever, fixes #2623

I was thinking about first having hashicorp/consul-template#939 merged, but actually there is not really a need to couple this.

This is WIP because I am first testing this patch in the nomad cluster where I am experiencing the issue regularly.

stefreak · 2017-05-22T22:18:41Z

I've now tested this in my cluster with success, a consul outage (e.g. no cluster leader) does not cause existing nomad tasks to be stopped with this patch

before this patch nomad only tried 5 times and then killed the job permanently, even after consul recovered.

stefreak · 2017-05-23T16:39:44Z

Because timeouts need to be counted as well, the actual time until the job is killed is about 7 minutes in my test...

dadgar · 2017-05-23T18:50:03Z

@stefreak What do you think about exposing the retry behavior in the job spec and instead of having the individual template retry forever we make it a restartable error so that it counts against the restart policy.

The reason I think this may be better is there could be a partition such that you may only want to retry on a single node for an hour and then want it restarted and placed on another node. This would mitigate the case where a consul agent was misbehaving or that node was partitioned from Consul.

stefreak · 2017-05-23T23:02:16Z

@dadgar I agree this would also make hashicorp/consul-template#939 obsolete in this context, but I'm not sure yet exactly how to implement that, need to dig further into the code

dadgar · 2017-05-24T00:54:01Z

@stefreak I think we would still want that stuff to allow a high retry count but non infinite with a max time.

Gonna paste some links here for you to get a sense of the code base:

Adding the names to the job spec:

Internal structs: https://github.com/hashicorp/nomad/blob/master/nomad/structs/structs.go#L2916
Api structs: https://github.com/hashicorp/nomad/blob/master/api/tasks.go#L329
Parsing code: https://github.com/hashicorp/nomad/blob/master/jobspec/parse.go#L851

With that done you will more or less have access to it in the client:
Setting the retries: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L393
Killing the task without making it a failure: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L216

There we need to parse the error and change the kill to be a non-fail kill!

stefreak · 2017-05-24T01:30:32Z

@dadgar I will have a look, thanks for the help, you're awesome :)

Do you think it would still be worth deviating from the consul_template retry defaults? For me it was pretty counter intuitive that a consul outage has cascading effects to the availability of tasks using the template stanza, and even with non-fail kill the task would be pending after it was killed, until consul is available again.

On the other hand I do understand you considering the possibility of a partition or just the local consul agent failing.

Question is what is the default trade off:

higher availability + stale templates during partition or consul outage

or

always-fresh template + no availability during full consul outage

I think I'd go for availability by default, because I get exactly that behaviour with a traditional consul_template + systemd setup (but I can live with either of them)

dadgar · 2017-05-24T18:11:42Z

@stefreak I think once we expose the retry tunables, the job submitter can control that tradeoff. If you are okay with stale you pick a high or infinite retry attempts, if not you bound the retries and let nomad restart the task waiting for fresh configs.

sethvargo · 2017-05-30T13:23:02Z

Hey @stefreak

CT is released and tagged with those new options. Thank you again for the work and the PR. I think we'll need to update the vendored library here and consume/expose those new options.

stefreak · 2017-05-31T21:32:58Z

@sethvargo cool, i will start working on it as soon as i find some time

dadgar · 2017-06-23T18:37:29Z

@stefreak Should we close this in the meantime?

github-actions · 2023-03-31T02:13:44Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Consul template: retry forever, fixes hashicorp#2623

e7e12d7

stefreak changed the title ~~[WIP] Consul template: retry forever, fixes #2623~~ Consul template: retry forever, fixes #2623 May 22, 2017

stefreak changed the title ~~Consul template: retry forever, fixes #2623~~ [WIP] Consul template: expose retry configuration, fixes #2623 May 25, 2017

stefreak closed this Jun 23, 2017

stefreak mentioned this pull request Jul 8, 2017

Nomad job using template stanza permanently dead after transient consul errors #2623

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Consul template: expose retry configuration, fixes #2623 #2665

[WIP] Consul template: expose retry configuration, fixes #2623 #2665

stefreak commented May 22, 2017

stefreak commented May 22, 2017 •

edited

Loading

stefreak commented May 23, 2017 •

edited

Loading

dadgar commented May 23, 2017

stefreak commented May 23, 2017 •

edited

Loading

dadgar commented May 24, 2017

stefreak commented May 24, 2017 •

edited

Loading

dadgar commented May 24, 2017

sethvargo commented May 30, 2017

stefreak commented May 31, 2017

dadgar commented Jun 23, 2017

github-actions bot commented Mar 31, 2023

[WIP] Consul template: expose retry configuration, fixes #2623 #2665

[WIP] Consul template: expose retry configuration, fixes #2623 #2665

Conversation

stefreak commented May 22, 2017

stefreak commented May 22, 2017 • edited Loading

stefreak commented May 23, 2017 • edited Loading

dadgar commented May 23, 2017

stefreak commented May 23, 2017 • edited Loading

dadgar commented May 24, 2017

stefreak commented May 24, 2017 • edited Loading

dadgar commented May 24, 2017

sethvargo commented May 30, 2017

stefreak commented May 31, 2017

dadgar commented Jun 23, 2017

github-actions bot commented Mar 31, 2023

stefreak commented May 22, 2017 •

edited

Loading

stefreak commented May 23, 2017 •

edited

Loading

stefreak commented May 23, 2017 •

edited

Loading

stefreak commented May 24, 2017 •

edited

Loading