-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.6.3 Issue with templates being re-rendered every few seconds. #3197
Comments
Just a quick note, nomad is currently using root tokens to ensure it isn't a token permission issue. |
Root tokens are not renewable, so you'll need to disable that or use a periodic root token. That's one of the errors you're seeing. |
@sethvargo Nomad server has root token, the instance of Consul-template doesn't have root. It has a renewable periodic token with the @adamlc This is likely a consul-template bug. Can you reproduce this behavior by running an instance of consul-template with that template? |
@adamlc Any update? Can you also give the Consul and Vault versions. |
I found this issue in the google group and dadgar pointed me here. For me, it looked like the old mysql database backend was all that was affected, but I think it is probably more than that. I am not using a root token, but a token from the aws-ec2 auth backend. I just upgraded my test cluster from 0.6.0 to 0.6.3 and hit this. The update rate on the renewer is so fast that I had to shut everything down. |
@stevenscg Can you attempt to run your config with just the newest consul-template. I am almost certain it is a bug in that codebase. |
@dadgar I might be able to do that in a dev environment. It won't be the exact same setup, but the template and vault database backend and vault roles and policies will be. Might be tomorrow at the earliest. Are you looking for anything in particular. I just finished rolling back the test environment to 0.6.0 and the issue appears to be gone. |
@stevenscg That would be awesome. Mainly looking for as simple of a reproduction step as possible with just consul-template. If you could grab the debug level logs as well when you invoke that would be 👍 |
Will do. FWIW, this is fairly representative of what I was seeing:
That doesn't tell me much, but it might come in handy. It was fast too. The local syslog daemons couldn't keep up with it. |
@stevenscg What is the lease duration on secrets from: database/creds/billing |
Max ttl is 24h. Defaults for the backend, IIR. |
Sorry I haven't had time to look at this yet. I'll give it a go later if I have time :) |
@dadgar I just tried rendering the same template from before using the
It seems to cycle through this every 1-3 minutes. If I downgrade C/T to 0.19.2 and run the same setup in-place, the pattern is different:
The For reference, this particular template has 1 secret from the (deprecated) mysql vault backend, 3 keys from consul, 1 service from consul. I am not seeing any of the vault Versions:
|
@dadgar Another datapoint.... I went back to C/T v0.19.3 and dropped the mysql backend TTL from 1h to 5m and was able to trigger the renewer problem I saw in our deployed environment yesterday. In fact, any TTL smaller than about 18 minutes on this particular system would trigger the issue. 30 minute TTL was reliably stable. This system is very lightly loaded FWIW. The C/T output when the issue is triggered is roughly this:
|
@dadgar I've just tried using consul template and it seems to work as expected, it doesn't constantly cycle the credentials like nomad does: $ consul-template -v
consul-template v0.19.3 (e08c904) Using the same template as my nomad file. Also using a root token here. (Just thought I'd mention I've tried using a root token in Nomad and one supplied by ec2-auth, both have the same results) $ consul-template -consul-addr "54.229.122.21:8500" -vault-renew-token=false -template "template.tpl:template.json" |
I can confirm that I'm hitting this issue as well on Nomad 0.6.3, though I never ran Nomad 0.6.0 for comparison. These error messages cause enormous raft churn and consul servers go crazy trying to keep up with creating and reaping snapshots. It seems like this is affecting the Don't think we're doing anything special wrt TTLs of these mountpoints, but not sure where all to check:
|
Hey all, we will be looking into this before cutting 0.7, so the next release should resolve this. |
Hey all, to mitigate this until 0.7 is out you can set the vault_grace to be much smaller than the lease of your secret. Something like 5-15 seconds. |
@dadgar unfortunately I'm still getting this issue on 0.7.0, even after setting a low vault_grace period :( |
@adamlc What is the lease duration of your secret? Your vault_grace should be set to be less than your secrets lease duration. If this isn't the case, the template will re-render the template immediately since the secret will expire within the grace period. https://www.nomadproject.io/docs/job-specification/template.html#vault_grace |
@dadgar I set the grace to 10s and all the leases are >24 hours. I'll have to have a dive into the logs again. I'll see if I can come up with something producible for you to work with :) |
I have a suspicion this is something to do with the old mysql backend. I have 14k+ leases for mysql which probably got created when the tasks where restarting. I'm currently decommissioning the backend and revoking the leases. Hopefully this is whats causing it! |
@stevenscg its interesting that you also use the old mysql backend. I've not actually had time to take a look at this in more detail yet. I'm not sure what the default grace is but if you do a nomad plan before you should be able to see what the default / current one is. |
@adamlc Ah. Good point about nomad plan. I'll try that. Did you just move to the new database secret backend instead of messing with it further? |
I'll add that my database secret leases are 4h / 24h max:
|
I'm still rocking 0.6.3 at the moment. I have migrated to the database backend but haven't had time to test the setup in 0.7.0 yet. |
Got it. FYI, the default vault_grace does seem to be 15s as documented. I'm testing with an arbitrary 10s, but don't know why it would be any better. |
@dadgar I'm still seeing this. Any thoughts before I roll back to 0.6.0 ?? Here's where I'm at:
|
I'll add the re-render rate with this new database backend seems higher than the legacy mysql backend. Up to approx 4-5 seconds between vs 20 or 30s. |
@stevenscg I got similar setup, with the exception of Vault, which is 0.8.3 - no problems on 0.7 |
@jippi That's the crazy part to me. I'd think that my setup is really a very common one. My vault was at 0.8.3 earlier today and the issue was happening both before and after the upgrade to 0.9.0. |
@jippi How do you upgrade your nomad server and worker nodes? I typically upgrade them in-place carefully one at a time servers then workers. |
I've posted some additional details in the mailing list in case anybody else is experiencing this problem, though I'm worried that it is something very specific with my environment. |
I ran the templates through consul-template versions 0.19.3 and 0.19.4. Both gave the same sys/leases/renew error. It looks like this is going to be a vault default policy issue in my environment. The default policy for my deployed environments has been around since Vault 0.6 or something. On our development systems that were not experiencing this issue, vault is loaded fresh and thus the default policy is also refresh. I vaguely remembered a comment or a discussion about the vault policy changing and the task token needed access to sys/leases/renew in vault. The piece I needed is here: But not on the nomad website just yet. |
I'm just looking at this issue again and I'm still having issues with 0.7.1. At first I thought it may be my database backend, I didn't have TTLs set for the database roles, so I thought that may be the issue, unfortunately not. I'm going to go through each secret referenced to make sure they have TTLs. Other than the above I'm stumped! 😢 I'll post here if anything comes to light! |
Turns out that my vault default policy was completely different to newer versions, once I updated my default policy with one from a newer version it worked perfectly! Finally glad to have this sorted! Thanks @stevenscg for pointing me in the right direction! For anyone else with the same issue, simply grab the latest default vault policy and then overwrite your current default policy in vault and you should be good to go! |
@adamlc Glad it helped you! So much grey hair over that one. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.6.3
Operating system and Environment details
centos-release-7-3.1611.el7.centos.x86_64
Issue
We're currently on Nomad 0.6.0 in production with no issues. After trying a new 0.6.3 we're getting problems with our templates on jobs. Every few seconds the template is re-rendered and a SIGHUP us sent to the containers. This happens continuously even though the data in consul / vault isn't changing. I think its vault related because templates without vault secrets in seem OK.
Reproduction steps
I have a simple node AMQP consumer that connects to RabbitMQ and or internal API. The only dynamic part of the template is the AMQP settings.
Nomad Server logs (if appropriate)
Nothing seems out of the ordinary here. Just the server joining the new cluster.
Nomad Client logs (if appropriate)
The failure to renew the lease then just repeats in the log. Its odd this happens because the setup is exactly the same as the current production cluster.
I've checked by leases and stuff in vault and all seems good there.
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: