-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad doesn’t retain watcher state on restarts which causes secret engine lease issues #9491
Comments
@gulavanir - I was able to get this to work by setting an aggressive renewal time. Basically I configured consul to NOT use the defaults of 32 days by running vault secrets tune -default-lease-ttl=30s -max-lease-ttl=1m consul/ I was able to see Nomad/Consul-template refreshing the token every few seconds (given the aggressive expiry I set). With the above settings I could see Nomad refresh the token and then when it reached the max-least-ttl value it deprecated the old token and issued a new one (which then caused Nomad to rerender the template and start the container) I am wondering if what you see has to do with a watcher failing (since the renew time is so long by default) and it never recovers? |
@fredwangwang was able to reproduce this with some pretty catastrophic results. I've asked him to update this thread, but here's his comment on the Hashicorp Discuss forums https://discuss.hashicorp.com/t/nomad-job-doesnt-renew-consul-token-generated-using-nomad-job-templates-and-consul-secrets-engine/18316/9 @tgross - this is a nasty one |
tldr: state loss during nomad restart cause token lease leak in template link to discussion for context: https://discuss.hashicorp.com/t/nomad-job-doesnt-renew-consul-token-generated-using-nomad-job-templates-and-consul-secrets-engine/18316/9 The issue is that during nomad restart, nomad loss its state. The new nomad client will:
However, at this point the configuration (hence token) held by the process running inside the container is still the old one. Once the old token lease expires, the process start to fail. Logs (after nomad restart):
|
I think this is hitting a case a long-open PR #4226 is attempting to fix, but which we didn't have enough context for us to understand what was going on. I'm not sure why that hasn't had further review yet but we should make sure that gets onto our radar. Also folks, please let's avoid editorializing in issue titles. |
Ahh - thanks @tgross - good point on the title, and noted. Seems like the long open PR is related - We can reliably replicate it at this point so any details you need our team can provide. |
@tgross was thinking something similar to this #4226 (comment) last night. It works, however, it will trigger a coordinated restart of all the jobs running on that particular nodes whose template gets re-rendered every time nomad restarts. So it is a quick patch to this urgent problem, but to me feels like to properly correct this issue nomad should persist state across restart tho. |
Some "thinking out loud" after some initial investigation:
|
Wanted to leave a status update on the investigation so far. There are a few layers to this problem as presented, that make it a little harder to understand:
When Nomad is persisting a token for the task runner in the vault_hook.go#L129-L143, that's Nomad's token. The secrets for anything we render to a template have no saved state and that's the TTL that's expiring between restarts. This is effectively the same thing that would happen if we ran That also means we should be able to reproduce this with a simpler setup for any dynamic secrets, like a x509 cert. Fortunately we have an E2E test we can borrow for that. setup and configuration filesSingle-node Vault config: listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}
storage "raft" {
path = "/var/vault/raft"
node_id = "raft_node_1"
}
api_addr = "http://127.0.0.1:8200"
cluster_addr = "https://127.0.0.1:8201"
ui = true Bootstrap Vault with Nomad policy: export VAULT_ADDRESS="http://127.0.0.1:8200"
vault operator init -key-shares=1 -key-threshold=1
vault operator unseal
export VAULT_TOKEN=<redacted>
vault policy write nomad-server ./e2e/terraform/scripts/vault-nomad-server-policy.hcl
vault write /auth/token/roles/nomad-cluster @./e2e/terraform/scripts/vault-nomad-cluster-role.json
vault token create -policy nomad-server -period 72h -orphan
# get the resulting token for our Nomad config Single-node Nomad config: name = "standalone"
log_level = "DEBUG"
enable_debug = true
data_dir = "/var/nomad"
server {
enabled = true
bootstrap_expect = 1
}
client {
enabled = true
}
plugin "docker" {
config {
volumes {
enabled = true
}
}
}
vault {
enabled = true
address = "http://localhost:8200"
create_from_role = "nomad-cluster"
token = "<redacted>"
} Configure PKI engine pki-policy.hcl file: path "pki-test9491/issue/nomad" {
capabilities = ["create", "update", "read"]
}
jobspecjob "example" {
datacenters = ["dc1"]
group "group" {
task "task" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/www"]
volumes = ["/tmp/www:/www:ro"]
}
vault {
policies = ["pki-test9491"]
}
template {
data = <<EOT
{{ with secret "pki-test9491/issue/nomad" "common_name=nomad.service.consul" "ip_sans=127.0.0.1" }}
{{- .Data.certificate -}}
{{ end }}
EOT
destination = "${NOMAD_SECRETS_DIR}/certificate.crt"
change_mode = "noop"
}
resources {
cpu = 128
memory = 64
}
}
}
} Note in our jobspec that the $ nomad job run ./example.nomad
==> Monitoring evaluation "8ff50f77"
Evaluation triggered by job "example"
==> Monitoring evaluation "8ff50f77"
Evaluation within deployment: "0aee964a"
Allocation "d9b1a5c8" created: node "53d42465", group "group"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "8ff50f77" finished with status "complete"
$ nomad alloc status d9b
...
Recent Events:
Time Type Description
2020-12-09T19:19:58Z Started Task started by client
2020-12-09T19:19:57Z Task Setup Building Task Directory
2020-12-09T19:19:57Z Received Task received by client
# fingerprint the cert
$ nomad alloc exec d9b cat /secrets/certificate.crt | md5sum
2e5c01b4bf659e298664a88bf3798663 -
# wait until the TTL expires (plus some room), and verify the cert has changed
$ sleep 90 && nomad alloc exec d9b cat /secrets/certificate.crt | md5sum
7cae1b2510b24b02a0194ca6a231239e - Now let's test restarting Nomad and checking that the certificate is renewed when we expect: test1.sh#!/bin/bash
alloc_id=$1
now=$(date +%s)
echo "now: $now"
expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "expires at: $expiry"
echo "restarting Nomad..."
sudo systemctl restart nomad
now=$(date +%s)
sleep_for=$(( $expiry - $now ))
echo "waiting ${sleep_for}s for cert to expire..."
sleep $sleep_for
new_expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "new expiry: $new_expiry"
if [[ "$new_expiry" != "$expiry" ]]; then
echo "certificate was renewed!"
fi Results look good:
Now we'll test stopping Nomad, waiting for expiry, and then restarting Nomad. In this test we wait to stop Nomad so that we're not risking the allocation being marked as lost and confounding the problem. test2.sh#!/bin/bash
alloc_id=$1
start=$(date +%s)
echo "now: $start"
expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "expires at: $expiry"
sleep_for=$(( $expiry - $start ))
if [[ "$sleep_for" -gt "30" ]]; then
echo "waiting 30s to stop nomad..."
sleep 30
else
echo "only ${sleep_for}s left till expiry, stopping now..."
fi
echo "stopping Nomad..."
sudo systemctl stop nomad
start=$(date +%s)
sleep_for=$(( $expiry - $start ))
if [[ "$sleep_for" -gt "0" ]]; then
echo "waiting ${sleep_for}s for cert to expire, +5s..."
sleep $sleep_for
sleep 5
fi
echo "restarting nomad..."
sudo systemctl start nomad
# wait for nomad to be live again before querying it
while :
do
nomad status 2> /dev/null && break
sleep 1
done
new_expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "new expiry: $new_expiry"
if [[ "$new_expiry" != "$expiry" ]]; then
echo "certificate was renewed!"
else
echo "certificate was not renewed!"
fi The results look good there too!
Logs, for posterity. logs
So we see the secret being renewed as expected. But what about the allocation itself? Let's change out the
But the task isn't being restarted:
If we wait a minute for that new cert to expire, we're back in business:
So it appears that the problem is not that Nomad is losing access to the secret engine, or that consul-template isn't trying to renew the secret when it's supposed to. It's that when we restart the Nomad client we don't treat a change to the secret as a trigger for a task restart (or signal). We have a new secret, but haven't told the workload that there is one. I feel reasonably certain about this conclusion but don't yet have an answer as to how to fix it. That'll be what I dig into tomorrow morning. |
This is fantastic stuff @tgross - thanks for taking the time to write such a thorough analysis. I need to read it a few times to absorb it all, but the first pass makes sense to me and I understand what you're saying and it clearly lines up with the behavior we have seen |
After doing a bit of smoke testing and printf debugging, it looks like the // WouldRender determines if the template would have been rendered. A template
// would have been rendered if all the dependencies are satisfied, but may
// not have actually rendered if the file was already present or if an error
// occurred when trying to write the file.
WouldRender bool
// DidRender determines if the Template was actually written to disk. In dry
// mode, this will always be false, since templates are not written to disk
// in dry mode. A template is only rendered to disk if all dependencies are
// satisfied and the template is not already in place with the same contents.
DidRender bool The first time a template is rendered, these fields will both be true. After a client restart, if the secret hasn't changed |
"That does mean that tasks with dynamic secrets would always be restarted if the client restarts." FWIW that's exactly the hack we are putting in place. We using consul template task in jobs with secrets to watch for a KV entry for a node (that we change every time a nomad agent restarts) then using that to restart the container. Effectively the same thing but it impacts all our jobs as each one needs to implement this hack. I can say for sure that we would rather have nomad restart our containers with secrets than having to handle that ourselves in our jobs :) |
Hi @tgross Thanks for this great analysis! Regarding the potential solution in this comment |
Yes. But that patch is against an older version of Nomad that had a good bit of changes. #6324 from the same author is more recent (although doesn't merge cleanly with the current HEAD). There are some problems in the details of that approach:
So rather than trying to inject state into the hooks that bridges restarts, I'm working up a patch that detects whether there's an existing driver handle present (which covers a wider range of edge cases). Then I'm factoring out the on-change handling so that both the |
💯 agree, thanks for the update! |
My working branch for the patch is https://github.com/hashicorp/nomad/compare/b-template-first-render. I've only smoke-tested this locally and haven't written automated testing to explore the edge cases (so definitely don't run this build anywhere you care about! 😀 ). Hoping to wrap that up early next week. |
That's great. Thanks @tgross for a detailed write up and quick turnaround. |
That branch works fine in automated unit testing. And when I tried running it through end-to-end the change signal is being triggered as we wanted. This works perfectly with But with Alloc status for the failed alloc looks like:
Edit: |
"Edit: Better now than later! Appreciate the updates. We are tracking this one really closely |
After a long soak test it looks like the logmon error is a little misleading and the problem really is in the task restart, so I've got to rework the solution I've got in #9636. I haven't quite pinned down the problem yet. (The fun bit of this is that turning up logging to trace changes the timing and therefore the symptoms!) |
Ok, have that conclusively figured out now with a whole lot of log grubbing. It's mostly an artifact of the test rig; the PKI cert expires after 1 minute so my test rig has to line up the restart such that it lands before the cert expires. But that timing requirement can cause the client to be shutdown long enough that when it comes back up the server thinks the allocation is lost and tries to reschedule it. (I'd missed the single log line from the server asking for the alloc to stop among the ~30k entries. 🤦 ) There's a kind of gross but mostly unavoidable scenario here where because I'm testing with a single client, the scheduler will always place it on the same host and when that's happening right on top of a client restart we can be in the middle of trying to start the alloc while the shutdown notice from the server comes, which results in a taskrunner hook state that isn't viable and we're forced to reschedule again. It looks like there's a narrow possibility of running into this any time the client is shut down long enough to be lost, if the task is actually still running. But the reschedule is everything working as intended from that state. |
Fixed in #9636, which will ship in the upcoming Nomad 1.0.2. |
This is great stuff @tgross, thanks for all the work on it - we will definitely be upgrading to 1.0.2 and will be using this :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
v0.12.5
Operating system and Environment details
Linux
Issue
Consul token is not auto-renewed by nomad when dynamic secrets are used in nomad job template with consul secrets engine.
Reproduction steps
Followed steps as specified in https://learn.hashicorp.com/tutorials/nomad/vault-postgres.
The default ttl for this token in 30 days after which the lease expires and nomad doesn't auto renew the token.
Job file (if appropriate)
template {
data = <<EOH
{{ with secret “consul/creds/service-fabio” }}
registry.consul.token = {{ .Data.token }}
{{ end }}
registry.consul.kvpath = /path-to-config
registry.consul.noroutehtmlpath = /fabio/noroute.html
proxy.addr =
ui.addr =
registry.consul.register.addr =
metrics.target =
metrics.statsd.addr =
metrics.prefix =
metrics.names =
EOH
destination = “secrets/fabio.properties”
}
Nomad logs
agent.http: Request error: method=GET url=/v1/kv/fabio/noroute.html?consistent=&index=1143000&recurse= from=127.0.0.1:60596 error="rpc error making call
The text was updated successfully, but these errors were encountered: