-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad jobs should stay up when vault outage occurs #11209
Comments
Hi @mbrezovsky! Thanks for using Nomad, and thank you for filing this issue. I'll do some research and see if there is an existing solution for this use case. In the meantime, do you think you could provide both your nomad and vault configuration files for both servers and agents after removing secrets? It would be really helpful in reproducing your exact use case, and then troubleshooting from there. |
Of course, here are the configuration files.
Nomad agent
Vault
I use separated consul cluster for nomad and vault - as storage. I will really appreciate if you find some solution for this issue. If these configuration files aren't sufficient, I can provide detailed setup for better reproducing. It is based on terraform/ansible setup tied to hetzner cloud provider. |
I too have been running into this problem recently. We are currently in the process of migrating Nomad, Consul and Vault over to a new platform which has led to a couple of short Vault service outages. Each time that Vault has become unavailable to Nomad, all of the jobs that were configured to pull from Vault would wind up being rescheduled and unable to be allocated until Vault access was restored. I do see value with this behavior (and personally think it should remain as the default) but there are scenarios in which leaving the job running with the last obtained secrets remaining in use. My personal preference for addressing this would be to create an additional option in the Example of hypothetical
|
Unless I'm not following something it looks like if the renew fails it still marks the token as a renew:
According to the comment on this bool it should only be set to true if the token is updated:
|
That looks to only happens when using the It seems like it would make sense to separate actual change events from vault communication errors. As I proposed in my previous comment, this would allow a job to continue running through a communications issue while still performing the necessary action when an actual change to the secrets are detected. |
If it doesn't equal |
Right, meaning it sets I'm on mobile right now so I may just not be following the code paths or missing a key part of it, but it looks like once updatedToken is set to true, the next iteration in the loop will result in specified change mode being triggered. |
@rigrassm The issue that @eightseventhreethree is raising is that the token renewal failed, so it shouldn't be triggering the This section: nomad/client/allocrunner/taskrunner/vault_hook.go Lines 273 to 283 in 0af4762
contradicts the comment here: nomad/client/allocrunner/taskrunner/vault_hook.go Lines 181 to 182 in 0af4762
It also states in the documentation that |
Ahhh, ok, I see what I was missing yesterday. It's clearing the token which should put it in an indefinite loop until the token is renewed and then triggering the change mode. That is peculiar though as I have observed this behavior multiple times(even recently). If I get some time this week I may try and setup a test environment to reproduce this and also test if the behavior occurs when using noop to see if this is actually what's leading to the job restart. Beginning to think that it's possible something outside of this code path is leading to the job being restarted and this behavior is possibly just a side effect. |
I've mentioned that change_mode have no impact on this issue. I've already tested it on jobs with mixed change_mode values (noop, restart). All jobs was restarted when vault outage occured. That's the reason, why I created this issue. I don't know to solve it with available configuration options. |
@mbrezovsky Quick question, do any of your jobs templates use Consul lookups like I just realized while working on one of my job specs that I am using a Consul lookup in one of the jobs templates and a change to the service catalog entry Its looking up resulted in the job being restarted (default behavior of the I'm thinking it's possible these restarts could have been caused by a Consul catalog change which, if this occurred when vault access was down, would result in that job not getting placed. The message shown in the alloc activity could be misconstrued as there being a change in the Vault template when in reality, it was a Consul template that triggered the restart. If this is in fact the reason for the job being restarted, it may be worth exploring options to avoid this when using both Vault and Consul look-ups in templates. A couple possible solutions that could be implemented into Nomad off the top of my head for this:
|
No, actually I have no configured job with consul lookups. Just templates with vault secrets. |
I ran into similar issues rendering vault and consul kv templates in nomad jobs and I would also desperately need a good fix for that problem. From my research so far it seems that nomad or the integrated consul-template implementation is polling the templates' data from consul and vault on a regular basis. If for some reasons consul kv store or vault are not available nomad will start a number of retries with time backoffs. The default number of retries is set to 12 (consul, vault) after the max number of retries exceeds the job often stops running, which makes my setup much less resilient than I would have expected. This appears to me to be exactly the same scenario of the pending restart @mbrezovsky mentioned.
Or maybe an even more general option could be to expose the full consul-template config and let the user add that custom override file into the nomad agent configuration directory. |
We have managed to track down at least one (and hopefully the only) cause of this behavior: In the allocrunner/taskrunner code for template tasks, there is a rerenderTemplate function that loops forever selecting on channels. One of the channels reports errors from the task runner. If an error occurs, nomad marks all tasks as failed. The impact of this is that at a time when nomad is completely unable to successfully start allocs (because the template renderer cannot contact vault), it brings them all down, essentially (in our case) converting a template render error into a complete site outage. The reason this doesn't occur when templates are configured with change_mode=noop is that there is a previous code segment that checks if all templates are noops, and if so, returns before starting the loop. When I am back at my work computer, I will link the relevant lines of code. In my opinion, the proper behavior when unable to render a template for any reason is to log the error, and stop the current attempt to update the alloc in question. I cannot think of a situation in which failing healthy running services when they cannot be restarted is a reasonable behavior, and if there are such situations, one should not be subjected to site outages as the default behavior. |
The code block which marks the services as failed is at L378 here: nomad/client/allocrunner/taskrunner/template/template.go Lines 365 to 386 in 0af4762
The reason this doesn't happen when (all) templates have a change_mode of nomad/client/allocrunner/taskrunner/template/template.go Lines 229 to 234 in 0af4762
As a hacky workaround, we had hoped to be able to extend the connection retries on the consul template runner to multiple days to give us time to get vault healthy, but that parameter is not currently exposed in nomad. (It would be the
But those values will not be configurable until #11606 is deployed. |
As @jcdyer mentioned, this PR should expose the Vault retry configuration parameters and allow for a more fault-tolerant approach to |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Proposal
This issue aims to integration with vault. Running nomad jobs should stay to live as long as possible after vault outage.
Use-cases
We have nomad cluster and vault in HA. Despite of clusters are scaled to cover minor outage of servers, sometimes it's unpredictable. After network issues on cloud provider side, our vault cluster was out of the service.
Firstly nomad receive 503 response from vault and some short period restart all jobs with template stanza. All jobs ended in pending state until vault has been online. It not depends on change_mode in template stanza.
Attempted Solutions
Nomad could to switch to emergency mode when vault outage is detected. In this case common processes (ttl...) could be disabled and enabled again when vault will be available.
The text was updated successfully, but these errors were encountered: