Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplained, persistent CPU usage increase on Vault cluster #3265

Closed
wuub opened this issue Aug 31, 2017 · 8 comments
Closed

Unexplained, persistent CPU usage increase on Vault cluster #3265

wuub opened this issue Aug 31, 2017 · 8 comments

Comments

@wuub
Copy link

wuub commented Aug 31, 2017

I first reported it in nomad repository (link in References:) as is seems to be correlated with nomad cluster upgrade , but @dadgar advised to continue investigation here .

Environment:
Consul v0.8.4
Nomad v0.6.2
consul-template v0.19.0 (33b34b3) (mostly, this is unfortuneately not homegenous)

  • Vault Version:
    Vault v0.7.3 ('0b20ae0b9b7a748d607082b1add3663a28e31b68')

  • Operating System/Architecture:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=16.04
    DISTRIB_CODENAME=xenial
    DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
    Linux 4.4.0-70-generic Cloudflare's certificate sharing #91-Ubuntu SMP Wed Mar 22 12:47:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Vault Config File:

backend "consul" {
  address = "localhost:8500"
  path = "vault/"
  disable_clustering = "false" 
}

listener "tcp" {
  address = "0.0.0.0:8200"

  tls_cert_file = "/etc/vault.d/cert/[srv].cert.pem"
  tls_key_file  = "/etc/vault.d/cert[srv].key.pem"
}

#telemetry {
#  statsite_address = "127.0.0.1:8125"
#  disable_hostname = true
#}

Startup Log Output:
https://gist.github.com/wuub/cd9f9000bc977279c36dbe3adc338fd2

Expected Behavior:
Semantically vault continues to works nominally, but its cpu usage exceeds previously seen averages significantly (order of magnitude) w/o obvious reason or increased workload.

Actual Behavior:
You can see a series of step downs in the attached load_1 graph. It shows some behaviors we've experienced:

  • unexplained initial spike of CPU load approximately 1h into nomad cluster upgrde (non-homogeneous nomad cluster at this point) (#red)
  • CPU load following a active leader trough a series of step-downs (in general)
  • CPU load not falling on a node that gave up active duty (blue/green)

selection_565

Steps to Reproduce:
n/a

Important Factoids:

  • nomad jobs are a main consumer of vault secrets trough template stanza.
  • neither nomad nor vault logs point to any work done by vault leader, that would explain higher CPU usage
  • we're running with audit log (and telemetry) disabled right now due to panics in Vault 0.7.3. this lowers visibility into vault internals significantly. We are upgrading Vault right now and will enable audit logs asap
  • this can only be correlated with nomad upgrade and not cause by it. we did not stop normal operation during the upgrade, so new jobs were being deployed, old jobs were being stopped by our team.

References:
Initially reported in nomad repo: hashicorp/nomad#3133

@jefferai
Copy link
Member

I don't see how, given Consul and Vault staying the same and only a Nomad upgrade, there is a reason to think that this is Vault related.

What else is running on the node?

@wuub
Copy link
Author

wuub commented Aug 31, 2017

We collocate nomad/consul/vault servers on the same set of machines, and historically load on those was low.

With something like this being a standard load that we're accustomed to for non-leader nodes.

VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                   
169420 124632  50612 S  7.6  3.3   3571:43 /usr/local/bin/consul agent --config-dir /etc/consul.d -advertise=...                                                                                             
18860  14148   3092 S  5.6  0.4 286:13.03 /usr/local/bin/consul_exporter -consul.server=localhost:8500                                                                                                              
708724 252140 124356 S  2.3  6.7  64:50.50 /usr/local/bin/nomad agent --config /etc/nomad.d  

EDIT: ^ yes, no vault in sight, because it's historically was so low on CPU usage, that we didn't see it in top at all.

Leader consul node can sometimes spike to mid 20-30 %CPU due to significant number of watches.

During nomad update vault leader started to expierience loads like this:

PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                   
20   0   52508  52140  35136 S 74.0  1.4   1090:02 /usr/local/bin/vault server -config=/etc/vault.d/vault.hcl -log-level=trace 

with consul and nomad servers staying sub 10 %CPU.

What I'm really looking for is an idea how to check what vault is doing because strace/gdb/log-level did not get us anywhere.

Intresting factoids (noticed right now.)

  • while disk operations, disk bytes and network bytes stayed the same, the number of network packets that vault leader (EDIT: obviously this is on the node level) is receiving and sending back is at an all time high. this sometimes stays high even if we instruct active vault to step down.

selection_567

EDIT"

I don't see how, (...), there is a reason to think that this is Vault related.

At the risk of stating the obvious:

  • high CPU utilization is caused by vault process
  • high CPU utilization follows active vault server after forced step-downs
  • high CPU utilization of vault process often do not stop even after it steps down active duty, it only stops after we restart it and unseal again

@jefferai
Copy link
Member

At the risk of stating the obvious:

high CPU utilization is caused by vault process
high CPU utilization follows active vault server after forced step-downs
high CPU utilization of vault process often do not stop even after it steps down active duty, it only stops after we restart it and unseal again

Please note that you originally said that the node experienced high CPU utilization, not the Vault process, and did not break down the information further. The graph has no labels. As you stated, the same node is hosting Vault, Nomad, and Consul, and generally in this kind of situation we would expect Consul CPU utilization to be high, not Vault (due to, during an active node changeover, Vault loading a potentially large number of leases from Consul).

@jefferai
Copy link
Member

Is Vault active (responding to requests) during this period of high CPU utilization?

What kind of workloads are being run against Vault?

Do you experience this with 0.8.1?

@wuub
Copy link
Author

wuub commented Aug 31, 2017

@jefferai please look at original issue reported in nomad repo that I linked above: hashicorp/nomad#3133 High vault CPU use is clearly stated there, I assumed you took a look there as well. Sorry for not repeating that here more explicitly, I'll try to be more careful with stuff like this in the future.

As to the unlabeled graph and other obscured information: we're still in the process of investigating this and I'm trying to share as much information as possible with you without exposing too much of our internal topology in public. I hope you understand my concern.

Back to the issue at hand:

We're getting closer, to pinning this down. The amount of TCP traffic between a few of nomad client processes (traffic is between main nomad processes and vault, not between nomad-executors and vault) and vault servers is insane. On the order of 20MB/sec. While other nomad clients in a different classes generate only KBs/sec.

This is also seems to be a reason why high CPU stays on a stepped down vault server, because a high traffic is being forwarded trough it to the currently active vault leader.

Is Vault active (responding to requests) during this period of high CPU utilization?

Yes, Vault is active and correctly responding to requests the whole time. It's just that we're seeing CPU usage 40x-80x higher than expected and network traffic that should not be needed for Vault.

What kind of workloads are being run against Vault?

consul-templates, either directly (a couple) or indirectly using template {} stanza in nomad (more, but nothing to write home about, maybe 50-100 total instances)

Do you experience this with 0.8.1?

We did not update yet, but after recent developments of noticing super high traffic from nomad to valut I'm pretty certain upgrade would not do much. But we will upgrade to enable audit logs.

Right now our working theory is that there is no problem at all in vault itself. If anything, it's performing admirably under heavy traffic.

Most likely some combination of our jobspecs, new nomad version and vault policies is causing a tight loop of renews or something similar that's not beeing logged by anyone.

We're concerned about interaction of explicit_max_ttl="0s" (set so low for periodic tokens) and new max_grace="5m" and it's description here (https://www.nomadproject.io/docs/job-specification/template.html#vault_grace)

I'll share more information as we gather it.

@jefferai
Copy link
Member

Hi there,

It seems very reasonable that if Nomad is making huge numbers of requests to Vault that you'd experience high CPU usage; for each request it has to do a bunch of work. So given the information I would guess your hunch is right.

Explicit max ttl set to 0 simply means that there isn't an explicit max ttl set. Generally this is what you want with periodic tokens, since the point of periodic tokens is for them to be able to live so long as the process they are attached to (as in, whatever is renewing them), with a relatively low TTL.

What is your period set to? If it's set super low, if a particular Nomad agent has spun out significantly numbers of tokens with a low period and with templates, I can imagine that you're just seeing tons of renewal requests.

As you suspected an audit log would help quite a lot :-)

@wuub
Copy link
Author

wuub commented Aug 31, 2017

@jefferai we've fund it.

But we will upgrade to enable audit logs.

we enabled audit log and instantly found the job resposible

Most likely some combination of our jobspecs, new nomad version and vault policies is causing a tight loop of renews or something similar that's not beeing logged by anyone.

^ I can confirm that this is mostly true and that tight loop originated from nomad as we initially suspected.
Sorry for taking this here and wasting your time, we'll create a reproduction steps tomorrow and take it back to nomad repo :)

@wuub wuub closed this as completed Aug 31, 2017
@jefferai
Copy link
Member

No problem! Glad you found the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants