-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Detect credentials changes without application restarts #985
Comments
Thanks! TBH I would rather go for the latter option. Something that might be helpful is this: cert-manager/cert-manager#1168 as it is exactly the same problem. Particularly https://github.com/pusher/wave Essentially on k8s you would have such cases all the time, so this might be the way to go for now. Not sure if it's worth to add complexity to the code as restart might be just enough. We might want to enumerate all components and see if restart is safe. E.g all components, especially Prometheus + sidecar required HA for safe rolling restart etc. |
While Wave seems interesting, to be honest I'm not too enthusiastic about depending on yet another service just to handle the rotation of the config file. It means I need to worry about the integrations with this extra service (does it support StatefulSets and DaemonSets? Or just Deployments, and the others are left out in the cold?), whether the service is highly available, any physical resources required to run this additional service, etc. It also makes it more difficult to integrate with other services. As an example, the prometheus-operator completely manages the StatefulSets it creates. If it didn't provide a way to set custom annotations on those StatefulSets, how would I opt those pods into using Wave? (Thankfully the prometheus CRD has a K8s already provides updates to ConfigMap and Secret volumes on the kubelet's resync interval, so IMO any applications that expect to run on K8s should also expect any config file mounted from a ConfigMap or Secret to be updated during execution and should handle that accordingly. However, your concerns about maintaining availability during restarts triggered by a config rotation are definitely valid; applications that need to coordinate restarts between highly-available components would need to do so through a ConfigMap or something similar, which definitely adds complexity on the Thanos side. As another option, it should be pretty simple to propagate the credentials errors that are thrown when the credentials are invalid ( |
I agree with your points. Let's see what others think, cc @GiedriusS @domgreen Current options.
Maybe we should look on other projects like envoy, prometheus (creds to kubernetes API eg), etc how they do it? I think because Prometheus configuration is reloadable, then it might be doable. |
I'm not sure I follow you here. My understanding is that the signature of S3 upload requests is calculated up-front using the provided access key/secret, the region, the date, and a few other parameters. Just changing the credentials on-disk wouldn't invalidate requests in-flight, as their signatures would have been calculated beforehand using the previous set of credentials. Can you clarify why swapping the credentials for ongoing requests is an issue? |
Sure, I meant it's just you need to swap "HTTP clients" while not affecting application. To do so you need to close all pending connections and not start new ones and swap and unblock again. In terms of current implementation it's not trivial. |
Sorry, I'm still not understanding which part is non-trivial. I'm also confused as to why we would need to close pending requests and block while swapping clients, as both clients should be able to process requests during the cutover; it seems like you might be confusing rotation, which deploys a new set of credentials, and revocation, which deletes the old set of credentials (sorry again if this is not the case). Here's the flow I have in mind:
The old client's credentials are still valid after rotation, it's only after revocation that they become invalid. Shouldn't this be as simple as checking / updating the credentials at the start of every Any synchronization around a hard switch of credentials affecting in-flight requests I'd think would definitely be out-of-scope for this feature. |
@bwplotka After digging into the code a bit more, it appears that (at least for S3/Minio clients) this is only an issue when passing the credentials in either as environment variables, or (as we're currently doing) setting the The solution for those who want dynamically-provided credentials from an external service is to format them as a standard config file and mount it where the client expects it (i.e. Closing since there's no code updates required, but expanding the documentation around providing credentials to explain which ones are static and which ones can be refreshed dynamically might be a good idea. 😉 Thanks again for taking the time to look into this. |
@Capitrium asked:
|
So ... it's doable? have you tested it? |
I dug into this a little more on Friday and over the weekend; the short answer is no. It would be doable for S3 if Thanos was using the aws sdk, but the minio-go client doesn't expose the function in the aws-go-sdk to re-read credentials from disk, so it will keep using the same set of cached/expired credentials. 😞
Doesn't the sidecar run a single sync loop here? This is probably the most critical one since without it, metrics stop getting uploaded to the objstore after old credentials are revoked. The store may be trickier, but there aren't any HA concerns for the store or the compactor that I'm aware of, so those could probably just be restarted by a sidecar. I don't actually run the ruler component so I have no idea as far as that one goes 😆 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
unstale please ) |
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions. |
unstale please ) |
Thanks for being patient @george-angel. It adds some complexity, but I think this could be added to Thanos. We are already working on reloading TLS certs for various HTTP and gRPC connections, so we can do this for buckets as well. Help wanted to design and implement 👍 |
We are a little busy right now, but as soon as we have some time, will make sure to drop a line here to say we are working on this. |
I'm running into this issue as well, we use Hashicorp Vault for secrets management and Vault supplies short lived AWS credentials, so we rotate the S3 keys regularly. Looks like the only option at present is to restart Thanos when the keys change? |
Right now, yes.
…On Wed, 12 Feb 2020, 21:15 Paul Greidanus, ***@***.***> wrote:
I'm running into this issue as well, we use Hashicorp Vault for secrets
management and Vault supplies short lived AWS credentials, so we rotate the
S3 keys regularly. Looks like the only option at present is to restart
Thanos when the keys change?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#985?email_source=notifications&email_token=ABVA3OZWWEQKU4N65ZR2OP3RCRRHRA5CNFSM4HB22AG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELSNDCI#issuecomment-585421193>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3O6OQUNBPXHGXCINCD3RCRRHRANCNFSM4HB22AGQ>
.
|
Hello all. I just dealt with this for a different Go project rwynn/monstache#333. The end solution seems to work really well. I have a design/rough draft for Thanos which does something similar #2135. If this looks like something we want to do then we can polish up the code and get it production ready. Thoughts? I hope this helps move this forward! |
Thanks!
Question is do we have agreement on the implementation details or do we
need to go through design proposal process:
https://thanos.io/contributing/#adding-new-features-components
I looked briefly through your PR and I think we need something on top of
object storage interface / client, not the particular implementation like
S3.
…On Thu, 13 Feb 2020, 19:11 Kush Patel, ***@***.***> wrote:
Hello all. I just dealt with this for a different Go project
rwynn/monstache#333 <rwynn/monstache#333>. The
end solution seems to work really well. I have a design/rough draft for
Thanos which does something similar #2135
<#2135>. If this looks like
something we want to do then we can polish up the code and get it
production ready. Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#985?email_source=notifications&email_token=ABVA3O6WSGW4D4MG7OPIC4TRCWLOVA5CNFSM4HB22AG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELWH42Q#issuecomment-585924202>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3O7UTKOSK3SBMQO4ABDRCWLOVANCNFSM4HB22AGQ>
.
|
It's a non-breaking change and relatively small, so as long as there are no objections we probably don't need to create a design doc for it. In order for it to be on top of object interface/client we would need to have an interface for Credentials across the different clients and implement that interface for each. Then we can call a general Expire function on those Credentials. Right now we don't store credentials at that layer so I kept it close to where we create the AWS Credentials. If we want to do a larger re-factor later to generalize credentials then this would not impede that because of its non-breaking nature. Thoughts? I think it would be hard to put it at the interface/client level unless you have an easy idea and I missed something. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
remove stale |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
remove stale |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
remove stale |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
remove stale |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
remove stale |
how about flipping liveness check if upload is failing and then just let liveness probe restart the container? |
One idea: Instead of reloading, we could just have API for applying configuration at some point. cc @smintz |
I've been thinking about this for other applications and when deployed on Kubernetes for example I see no reason for applications not to automatically reload configuration that changes on disk, as it is remounted atomically anyways. So I think Thanos, Prometheus, etc. should all have an opt-in flag |
I do not really need this but I can see the need. I think we should have a more straightforward api than metrics to see if the config is valid then (and not /-/healthy as it would kill the pod which could not restart). I also think that we should use inotify but also throttle the reloads (e.g. once per minute) because a reload is still heavy in prometheus. Feel free to open an issue on prometheus/prometheus and talk about it on the mailing list and/or dev summit since we probably want that in sync with all the projects, not just prom/prometheus. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
Currently, Thanos reads the bucket configuration file only at startup. This works fine for most cases, but becomes problematic when the credentials are generated and rotated by an external service. In this case, the credentials that Thanos started with will become invalid and interactions with the configured object storage will fail until Thanos is restarted, which is not ideal.
Some potential solutions to this problem include implementing a hot reloading feature, or adding a sidecar container that triggers a restart of Thanos when the config file is updated.
The text was updated successfully, but these errors were encountered: