-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[unsure, still investigating] significant rise in vault leader CPU usage during and after upgrade to nomad 0.6.2 #3133
Comments
log-level trace gave nothing.
also, as a result of several step-downs while trying to force active on a node with TRACE logging , one of the old leaders is still hovering at 50-60% CPU even though is's not on active duty. Old leader
New Loader
is this something we should discuss in vault repo instead? |
@wuub I would move this to Vault. I just went through the change history of the files that touch Vault and nothing has really changed since 0.5.6. Do you use the template stanza with Vault? I am going to close this but we can keep discussing. |
I'll drop by Vault issues tomorrow. And yes, we use template stanza a lot. With and without values from Vault. |
Hey @wuub, I was following the Vault issue. It would be great if you could share what you can now about the job causing this. Would like a 0.6.3 to include any fix that comes from this and would like to start investigating asap. |
@dadgar I'll try to distill our jobspec to something that triggers this behavior later today. |
@dadgar more investigation will have to wait untill tomorrow, but this is the obfuscated jobspec
I don't know what might be causing constant queries about the secret yet, but one thing that this file has different is change_mode="noop" + mixing |
@wuub are those allocations restarting a lot because of the template re-rendering? Could we see the task events? |
@dadgar does not look like it:
EDIT: we also have several different alerting rules that detect flapping services (by metric restarts, consul checks flapping etc. and none of them said anything about this job restarting, so I'm pretty confident it stays on once started) |
@wuub Would it be possible to share the relevant audit logs and dump the telemetry by sending SIGUSR1 while Vault is under high load. These will help pin down what is happening. |
We will try to track this down today. |
I'm able to reliably reproduce this with empty |
@dadgar I set it up with tmux/tmuxp -> https://github.com/wuub/nomad-vault-repro/blob/master/tinycloud.yaml Consul v0.9.2 Vault audit logs are full of (a LOT of bytes/sec).
|
Not sure if this is related but I'm having issues with 0.6.2 and the templates constantly renewing credentials from vault and causing a SIGHUP. Theres some logs and background here. |
@wuub Thanks for reproduction steps! Reproduced it now |
@wuub Just an update, this is a bug in the latest consul-template which we vendor in. Can reproduce with Nomad out of the picture. |
@adamlc would you mind running your template using consul-template on 0.19.0 and 0.19.1 and see if the issue only appears with 0.19.1? |
This fixes an issue introduced in e2155b4 where renewals againsts the generic secret backend that has a TTL close to the VaultGrace would cause a renewal hotloop. This fixes hashicorp/nomad#3133
This fixes an issue introduced in e2155b4 where renewals againsts the generic secret backend that has a TTL close to the VaultGrace would cause a renewal hotloop. This fixes hashicorp/nomad#3133
Hey all this is fixed and vendored in. |
@wuub We are going to do an 0.6.3 RC1 on Monday. Would appreciate if you could test for this! Thanks and hope you have a good weekend! |
* Vendor consul-template Fixes #3133 * changelog
* Vendor consul-template Fixes #3133 * changelog
@adamlc we're rolling out 0.6.3-rc1 right now. we'll see in an hour or so. |
Fair enough, clearly I've got some sort of other issue. I'll do some digging and open a ticket when I figure out the issue |
@adamlc Want to post your template/job file? How sure are you that the data is not actually changing in Consul? Also could you share the DEBUG logs of the client. |
Sorry I haven't had time to debug this yet. I'll give you a shout when I've got all the required information :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.6.2
Operating system and Environment details
Linux
Vault: 0.7.3
Issue
At some point during our upgrade from nomad 0.5.6 to nomad 0.6.2 (correlation, !causation) we've got an alert about high cpu usage on our vault leader. we've since forced different leaders to step-down a several times, and high cpu load follows the leader even after a restart.
vault
logs are identical to the ones we were seeing when cpu usage was 10x-20x lower.my weak
strace-fu
andgdb-fu
did not point to anything obvious.next step: we'll try to run one vault server with log-level=trace, and force it to assume leader role using a series of step-downs.
Do you have any ideas about other things we can check that could have such effect on the vault leader?
Reproduction steps
unsure
Nomad Server logs (if appropriate)
silent
Nomad Client logs (if appropriate)
nothing vault related
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: