-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexplained, persistent CPU usage increase on Vault cluster #3265
Comments
I don't see how, given Consul and Vault staying the same and only a Nomad upgrade, there is a reason to think that this is Vault related. What else is running on the node? |
We collocate nomad/consul/vault servers on the same set of machines, and historically load on those was low. With something like this being a standard load that we're accustomed to for non-leader nodes.
EDIT: ^ yes, no vault in sight, because it's historically was so low on CPU usage, that we didn't see it in top at all. Leader consul node can sometimes spike to mid 20-30 %CPU due to significant number of watches. During nomad update vault leader started to expierience loads like this:
with consul and nomad servers staying sub 10 %CPU. What I'm really looking for is an idea how to check what vault is doing because strace/gdb/log-level did not get us anywhere. Intresting factoids (noticed right now.)
EDIT"
At the risk of stating the obvious:
|
Please note that you originally said that the node experienced high CPU utilization, not the Vault process, and did not break down the information further. The graph has no labels. As you stated, the same node is hosting Vault, Nomad, and Consul, and generally in this kind of situation we would expect Consul CPU utilization to be high, not Vault (due to, during an active node changeover, Vault loading a potentially large number of leases from Consul). |
Is Vault active (responding to requests) during this period of high CPU utilization? What kind of workloads are being run against Vault? Do you experience this with 0.8.1? |
@jefferai please look at original issue reported in nomad repo that I linked above: hashicorp/nomad#3133 High vault CPU use is clearly stated there, I assumed you took a look there as well. Sorry for not repeating that here more explicitly, I'll try to be more careful with stuff like this in the future. As to the unlabeled graph and other obscured information: we're still in the process of investigating this and I'm trying to share as much information as possible with you without exposing too much of our internal topology in public. I hope you understand my concern. Back to the issue at hand: We're getting closer, to pinning this down. The amount of TCP traffic between a few of nomad client processes (traffic is between main nomad processes and vault, not between nomad-executors and vault) and vault servers is insane. On the order of 20MB/sec. While other nomad clients in a different classes generate only KBs/sec. This is also seems to be a reason why high CPU stays on a stepped down vault server, because a high traffic is being forwarded trough it to the currently active vault leader.
Yes, Vault is active and correctly responding to requests the whole time. It's just that we're seeing CPU usage 40x-80x higher than expected and network traffic that should not be needed for Vault.
consul-templates, either directly (a couple) or indirectly using
We did not update yet, but after recent developments of noticing super high traffic from Right now our working theory is that there is no problem at all in Most likely some combination of our jobspecs, new nomad version and We're concerned about interaction of explicit_max_ttl="0s" (set so low for periodic tokens) and new max_grace="5m" and it's description here (https://www.nomadproject.io/docs/job-specification/template.html#vault_grace) I'll share more information as we gather it. |
Hi there, It seems very reasonable that if Nomad is making huge numbers of requests to Vault that you'd experience high CPU usage; for each request it has to do a bunch of work. So given the information I would guess your hunch is right. Explicit max ttl set to 0 simply means that there isn't an explicit max ttl set. Generally this is what you want with periodic tokens, since the point of periodic tokens is for them to be able to live so long as the process they are attached to (as in, whatever is renewing them), with a relatively low TTL. What is your period set to? If it's set super low, if a particular Nomad agent has spun out significantly numbers of tokens with a low period and with templates, I can imagine that you're just seeing tons of renewal requests. As you suspected an audit log would help quite a lot :-) |
@jefferai we've fund it.
we enabled audit log and instantly found the job resposible
^ I can confirm that this is mostly true and that tight loop originated from nomad as we initially suspected. |
No problem! Glad you found the culprit. |
I first reported it in nomad repository (link in References:) as is seems to be correlated with nomad cluster upgrade , but @dadgar advised to continue investigation here .
Environment:
Consul v0.8.4
Nomad v0.6.2
consul-template v0.19.0 (33b34b3) (mostly, this is unfortuneately not homegenous)
Vault Version:
Vault v0.7.3 ('0b20ae0b9b7a748d607082b1add3663a28e31b68')
Operating System/Architecture:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
Linux 4.4.0-70-generic Cloudflare's certificate sharing #91-Ubuntu SMP Wed Mar 22 12:47:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Vault Config File:
Startup Log Output:
https://gist.github.com/wuub/cd9f9000bc977279c36dbe3adc338fd2
Expected Behavior:
Semantically vault continues to works nominally, but its cpu usage exceeds previously seen averages significantly (order of magnitude) w/o obvious reason or increased workload.
Actual Behavior:
You can see a series of step downs in the attached load_1 graph. It shows some behaviors we've experienced:
Steps to Reproduce:
n/a
Important Factoids:
template
stanza.References:
Initially reported in nomad repo: hashicorp/nomad#3133
The text was updated successfully, but these errors were encountered: