-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Support graceful reloading of TLS configuration #3247
Comments
While Nomad currently supports SIGHUP-based config reloading it's unfortunately limited to log level and vault. Thanks for the good writeup and usecase. This seems like something Nomad should handle. |
I think this might be it :) Will be paying attention to that pull request in the coming days. |
@SoMuchToGrok This PR actually adds the ability to SIGHUP and reload certs: #3479 The PR you linked will be a bit more comprehensive and will allow adding or removing TLS altogether with a SIGHUP. I am closing as this has been merged into master and will land in 0.7.1 👍 |
This feature will make my life so much easier. Keep up the great work Hashi team! |
There may have been a regression with this logic sometime after v0.7.1, specifically around reloading the RPC TLS config. After upgrading from v0.7.1 to v0.8.3, it appears that all RPC communication fails after sending a SIGUP to a process when the contents of the TLS certificates have changed (just the data itself, not the location of the certs on the filesystem). I'll open up a new ticket if I can confirm this. Nomad client logs
Task state logs (retrieved via Nomad API)
|
@SoMuchToGrok I tested certificate reloading in 0.8.4 and ran into issues that I mentioned in #4408. I didn't see RPC issues, but I was only running a single dev agent. Feel free to add comments to that issue or open up a new one. Sorry for the trouble! |
@SoMuchToGrok The team looked into this, and I wanted to mention two things we found:
|
@schmichael appreciate you and the team taking time to look into it. I tried to reproduce #4408 and I can confirm that I'm not experiencing that. I'm still trying to make sense of what issue I'm seeing. From what I've seen so far, the EOF errors never stop and continue endlessly. Additionally, once the EOF errors do start, all Vault interactions fail with an RPF EOF error. I've been able to reproduce this in 2 different environments now - the only resolution so far has been restarting the nomad service. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Within the TLS configuration stanza for nomad servers and clients, changes to cert_file, ca_file, or key_file require a full service restart. This can be cumbersome in an environment that frequently rotates its PKI.
Is it feasible to implement on-the-fly changes by sending a SIGHUP? For context, this is the pattern that Vault server utilizes, see PR here. This would be an invaluable feature for us and would drastically reduce the complexity of maintaining highly-available systems.
Currently, we're using triggers from consul-template combined with consul locks to maintain server quorum. Adapting this pattern to the nomad clients is a little more challenging, as consul lock timeouts (as well as consul-template timeouts) need to scale proportionally to the number of clients in the cluster. On the client side, we've accepted the fact that running jobs will occasionally have momentarily blips due to PKI rotations.
It would be fantastic to avoid these workarounds and limitations. How complex would this be? I'm trying to get an understanding if this is something we can help implement.
The text was updated successfully, but these errors were encountered: