-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes template stanza does not reload allocations #22224
Comments
@OfficerKoo your logs are mangled such that timestamps don't line up (and in reverse). But it looks like they're showing the template runner restarted the task at Then I see a gap of 2 hours before there's a user-initiated restart. Most of what I can tell here from the logs looks right... Nomad can't restart a task that hasn't yet started. If I'm missing something, it might help if you could show this with debug-level logs that have consistent timestamps. |
@tgross sorry about that, looks like logs got all twisted up when i were exporting them. You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait. We will try to enable debug logs on one client, but this will happen next week probably. |
Ok, just wanted to make sure I was reading that correctly. The Taking a quick pass over the code it looks like we send the But fixing the log wouldn't fix the underlying problem... we could drop re-render events such that the task start could race with any environment set by ``template.env
Thanks! If you look at the logs and see anything you feel is too sensitive to share publicly, you can email them to [email protected], which is only visible to members of the Nomad engineering team. |
Hate to hijack the issue but I've seen an issue with workloads not receiving a signal after updates occasionally as well. In my case what happens is that I have a workload with dynamic certs. When Nomad agent is restarted templates with dynamic certs are always refreshed, but in some cases the signal event isn't triggered. I see in the log:
As you can see I'm missing the expected re-rendered msg:
It doesn't happen often but often enough to occasionally break production workloads due to expired certificates. I also haven't been able to reproduce by restarting the same agent that had this issue before. The job template |
@tgross Hi, so we enabled debug logs, but didn't catch anything suspicious, it does look that error happens when container restarts, but only suspicious part in logs is
Which is happens right after final config is rendered. |
Nomad version
nomad_1.6.1
Operating system and Environment details
AWS, Amazon Linux 2, Production
Issue
We use Vault Nomad integration for database secrets in our application, sometimes Nomad re-renders the secrets in file, but fails to trigger restart, so env is not provisioned.
Usually 1 allocation from the task is affected, all jobs that use database secret engine are affected. restart fixes the problem.
Issue persists only on production, we failed to reproduce it on staging cluster
We checked some other issues, such as #6112 , but it does not fit our profile, as we don't have out jobs depending on other jobs, just straight up using vault database engine.
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
Sample job, but all the jobs that use database secrets are affected
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: