Feature Request: Detect credentials changes without application restarts #985

Capitrium · 2019-03-27T20:21:57Z

Currently, Thanos reads the bucket configuration file only at startup. This works fine for most cases, but becomes problematic when the credentials are generated and rotated by an external service. In this case, the credentials that Thanos started with will become invalid and interactions with the configured object storage will fail until Thanos is restarted, which is not ideal.

Some potential solutions to this problem include implementing a hot reloading feature, or adding a sidecar container that triggers a restart of Thanos when the config file is updated.

bwplotka · 2019-03-27T20:36:34Z

Thanks!

TBH I would rather go for the latter option. Something that might be helpful is this: cert-manager/cert-manager#1168 as it is exactly the same problem.

Particularly https://github.com/pusher/wave

Essentially on k8s you would have such cases all the time, so this might be the way to go for now. Not sure if it's worth to add complexity to the code as restart might be just enough. We might want to enumerate all components and see if restart is safe. E.g all components, especially Prometheus + sidecar required HA for safe rolling restart etc.

Capitrium · 2019-03-27T22:20:56Z

While Wave seems interesting, to be honest I'm not too enthusiastic about depending on yet another service just to handle the rotation of the config file. It means I need to worry about the integrations with this extra service (does it support StatefulSets and DaemonSets? Or just Deployments, and the others are left out in the cold?), whether the service is highly available, any physical resources required to run this additional service, etc.

It also makes it more difficult to integrate with other services. As an example, the prometheus-operator completely manages the StatefulSets it creates. If it didn't provide a way to set custom annotations on those StatefulSets, how would I opt those pods into using Wave? (Thankfully the prometheus CRD has a podMetadata field for this purpose, but hopefully you get my point. 😄)

K8s already provides updates to ConfigMap and Secret volumes on the kubelet's resync interval, so IMO any applications that expect to run on K8s should also expect any config file mounted from a ConfigMap or Secret to be updated during execution and should handle that accordingly. However, your concerns about maintaining availability during restarts triggered by a config rotation are definitely valid; applications that need to coordinate restarts between highly-available components would need to do so through a ConfigMap or something similar, which definitely adds complexity on the Thanos side.

As another option, it should be pretty simple to propagate the credentials errors that are thrown when the credentials are invalid (err="check exists: stat s3 object: Access Denied."), so that Thanos exits with an error and K8s can simply restart the container. This is the easiest/laziest option, but since Thanos is semi-broken when it's given invalid credentials I don't see it as a particularly bad one for now. 😆

bwplotka · 2019-03-27T23:24:27Z

I agree with your points. Let's see what others think, cc @GiedriusS @domgreen

Current options.

Add hot-reload. I find the implementation quite tricky as you need to swap the credentials in the ongoing requests to obj store so it must be syncronized. But doable and maybe worth it.
err="check exists: stat s3 object: Access Denied." option is not great as essentially you get random k8s reset no matter what. And in case of some components it is kind of painful (store gw, prometheus + sidecar), also is this working for non k8s setup as well? And how to even get access denied error in every provider? (:

Maybe we should look on other projects like envoy, prometheus (creds to kubernetes API eg), etc how they do it? I think because Prometheus configuration is reloadable, then it might be doable.

Capitrium · 2019-03-28T16:59:48Z

Add hot-reload. I find the implementation quite tricky as you need to swap the credentials in the ongoing requests to obj store so it must be syncronized.

I'm not sure I follow you here. My understanding is that the signature of S3 upload requests is calculated up-front using the provided access key/secret, the region, the date, and a few other parameters. Just changing the credentials on-disk wouldn't invalidate requests in-flight, as their signatures would have been calculated beforehand using the previous set of credentials. Can you clarify why swapping the credentials for ongoing requests is an issue?

bwplotka · 2019-03-28T18:19:47Z

Sure, I meant it's just you need to swap "HTTP clients" while not affecting application. To do so you need to close all pending connections and not start new ones and swap and unblock again. In terms of current implementation it's not trivial.

Capitrium · 2019-03-28T19:11:43Z

Sorry, I'm still not understanding which part is non-trivial. I'm also confused as to why we would need to close pending requests and block while swapping clients, as both clients should be able to process requests during the cutover; it seems like you might be confusing rotation, which deploys a new set of credentials, and revocation, which deletes the old set of credentials (sorry again if this is not the case).

Here's the flow I have in mind:

Thanos starts with credentials set A
Later, the credentials are rotated, resulting in 2 sets of credentials, A and B
Thanos receives credentials set B on the Secret volume mount on the kubelet's resync interval, replacing set A
Thanos swaps the http clients / bucket config / whatever, so that new requests use set B; in-flight requests are unaffected and are still using set A
Later, the old credentials are revoked. At this point, Thanos is only using set B, so there's nothing else to do application-wise

The old client's credentials are still valid after rotation, it's only after revocation that they become invalid. Shouldn't this be as simple as checking / updating the credentials at the start of every Sync call? Or maybe I'm just missing something 😅

Any synchronization around a hard switch of credentials affecting in-flight requests I'd think would definitely be out-of-scope for this feature.

Capitrium · 2019-03-29T16:40:31Z

@bwplotka After digging into the code a bit more, it appears that (at least for S3/Minio clients) this is only an issue when passing the credentials in either as environment variables, or (as we're currently doing) setting the access_key/secret_key values in the objstore config file. The client itself tries to get fresh credentials for every request, but in both of those cases the credentials provided are static, and the credentials provider will always read the same set of credentials until Thanos is restarted.

The solution for those who want dynamically-provided credentials from an external service is to format them as a standard config file and mount it where the client expects it (i.e. ~/.aws/credentials). The client will then read those credentials for every request, picking up any changes that might be propagated through the file mount.

Closing since there's no code updates required, but expanding the documentation around providing credentials to explain which ones are static and which ones can be refreshed dynamically might be a good idea. 😉

Thanks again for taking the time to look into this.

bwplotka · 2019-04-03T10:54:34Z

@Capitrium asked:

hey @bwplotka, before I reopen that feature request, can you help me understand why the implementation is non-trivial? I’m having a lot of trouble understanding why we can’t just check if the credentials have changed at the start of every Sync call and re-instantiate the bucket client if they have. I need hot-reloading pretty urgently so I can work on it myself if I need to, but I just don’t understand why it’s non-trivial to update the credentials.

We create bucket client on very start and it shared by everyone
For compactor it's easy I agree -> the start of every Sync as there is just single loop (it might change soon). BUT what about other components? sidecar, store, ruler? they will have to rotate as well? And there there is no single loop.

bwplotka · 2019-04-03T10:55:22Z

The solution for those who want dynamically-provided credentials from an external service is to format them as a standard config file and mount it where the client expects it (i.e. ~/.aws/credentials). The client will then read those credentials for every request, picking up any changes that might be propagated through the file mount.

So ... it's doable? have you tested it?

Capitrium · 2019-04-03T14:41:20Z

So ... it's doable? have you tested it?

I dug into this a little more on Friday and over the weekend; the short answer is no.

It would be doable for S3 if Thanos was using the aws sdk, but the minio-go client doesn't expose the function in the aws-go-sdk to re-read credentials from disk, so it will keep using the same set of cached/expired credentials. 😞

For compactor it's easy I agree -> the start of every Sync as there is just single loop (it might change soon). BUT what about other components? sidecar, store, ruler? they will have to rotate as well? And there there is no single loop.

Doesn't the sidecar run a single sync loop here? This is probably the most critical one since without it, metrics stop getting uploaded to the objstore after old credentials are revoked.

The store may be trickier, but there aren't any HA concerns for the store or the compactor that I'm aware of, so those could probably just be restarted by a sidecar.

I don't actually run the ruler component so I have no idea as far as that one goes 😆

stale · 2020-01-11T06:42:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

george-angel · 2020-01-13T14:04:39Z

unstale please )

stale · 2020-02-12T14:28:48Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

george-angel · 2020-02-12T14:53:48Z

unstale please )

bwplotka · 2020-02-12T16:14:32Z

Thanks for being patient @george-angel.

It adds some complexity, but I think this could be added to Thanos. We are already working on reloading TLS certs for various HTTP and gRPC connections, so we can do this for buckets as well.

Help wanted to design and implement 👍

george-angel · 2020-02-12T16:19:57Z

We are a little busy right now, but as soon as we have some time, will make sure to drop a line here to say we are working on this.

prg3 · 2020-02-12T21:15:35Z

I'm running into this issue as well, we use Hashicorp Vault for secrets management and Vault supplies short lived AWS credentials, so we rotate the S3 keys regularly. Looks like the only option at present is to restart Thanos when the keys change?

bwplotka · 2020-02-12T23:28:03Z

Right now, yes.

…

On Wed, 12 Feb 2020, 21:15 Paul Greidanus, ***@***.***> wrote: I'm running into this issue as well, we use Hashicorp Vault for secrets management and Vault supplies short lived AWS credentials, so we rotate the S3 keys regularly. Looks like the only option at present is to restart Thanos when the keys change? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=ABVA3OZWWEQKU4N65ZR2OP3RCRRHRA5CNFSM4HB22AG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELSNDCI#issuecomment-585421193>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3O6OQUNBPXHGXCINCD3RCRRHRANCNFSM4HB22AGQ> .

kush-patel-hs · 2020-02-13T19:11:37Z

Hello all. I just dealt with this for a different Go project rwynn/monstache#333. The end solution seems to work really well. I have a design/rough draft for Thanos which does something similar #2135. If this looks like something we want to do then we can polish up the code and get it production ready. Thoughts?

I hope this helps move this forward!

bwplotka · 2020-02-13T20:59:38Z

Thanks! Question is do we have agreement on the implementation details or do we need to go through design proposal process: https://thanos.io/contributing/#adding-new-features-components I looked briefly through your PR and I think we need something on top of object storage interface / client, not the particular implementation like S3.

…

On Thu, 13 Feb 2020, 19:11 Kush Patel, ***@***.***> wrote: Hello all. I just dealt with this for a different Go project rwynn/monstache#333 <rwynn/monstache#333>. The end solution seems to work really well. I have a design/rough draft for Thanos which does something similar #2135 <#2135>. If this looks like something we want to do then we can polish up the code and get it production ready. Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#985?email_source=notifications&email_token=ABVA3O6WSGW4D4MG7OPIC4TRCWLOVA5CNFSM4HB22AG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELWH42Q#issuecomment-585924202>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3O7UTKOSK3SBMQO4ABDRCWLOVANCNFSM4HB22AGQ> .

kush-patel-hs · 2020-02-13T21:12:12Z

It's a non-breaking change and relatively small, so as long as there are no objections we probably don't need to create a design doc for it.

In order for it to be on top of object interface/client we would need to have an interface for Credentials across the different clients and implement that interface for each. Then we can call a general Expire function on those Credentials. Right now we don't store credentials at that layer so I kept it close to where we create the AWS Credentials. If we want to do a larger re-factor later to generalize credentials then this would not impede that because of its non-breaking nature. Thoughts? I think it would be hard to put it at the interface/client level unless you have an easy idea and I missed something.

stale · 2020-07-16T07:34:53Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

george-angel · 2020-07-16T07:48:12Z

remove stale

stale · 2020-08-15T09:17:46Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

george-angel · 2020-08-17T06:38:11Z

remove stale

stale · 2020-10-16T07:28:18Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

george-angel · 2020-10-16T07:33:15Z

remove stale

stale · 2020-12-15T19:37:21Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

george-angel · 2020-12-15T20:34:40Z

remove stale

stale · 2021-02-14T19:43:48Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

george-angel · 2021-02-15T08:26:50Z

remove stale

LittleChimera · 2021-03-01T13:59:52Z

how about flipping liveness check if upload is failing and then just let liveness probe restart the container?
#3857

kakkoyun · 2021-03-11T10:21:03Z

cc @Biswajitghosh98

bwplotka · 2021-03-19T14:34:37Z

One idea: Instead of reloading, we could just have API for applying configuration at some point.

cc @smintz

brancz · 2021-04-08T08:06:47Z

I've been thinking about this for other applications and when deployed on Kubernetes for example I see no reason for applications not to automatically reload configuration that changes on disk, as it is remounted atomically anyways. So I think Thanos, Prometheus, etc. should all have an opt-in flag --automatic-reload (naming up for discussion of course). Then we wouldn't have to have a sidecar for every single thing we deploy just to reload configuration.

roidelapluie · 2021-04-12T10:59:49Z

I've been thinking about this for other applications and when deployed on Kubernetes for example I see no reason for applications not to automatically reload configuration that changes on disk, as it is remounted atomically anyways. So I think Thanos, Prometheus, etc. should all have an opt-in flag --automatic-reload (naming up for discussion of course). Then we wouldn't have to have a sidecar for every single thing we deploy just to reload configuration.

I do not really need this but I can see the need. I think we should have a more straightforward api than metrics to see if the config is valid then (and not /-/healthy as it would kill the pod which could not restart). I also think that we should use inotify but also throttle the reloads (e.g. once per minute) because a reload is still heavy in prometheus.

Feel free to open an issue on prometheus/prometheus and talk about it on the mailing list and/or dev summit since we probably want that in sync with all the projects, not just prom/prometheus.

stale · 2021-06-12T15:43:56Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-06-28T11:55:27Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

bwplotka added the feature request/improvement label Mar 27, 2019

bwplotka added difficulty: hard help wanted labels Mar 27, 2019

Capitrium closed this as completed Mar 29, 2019

Capitrium reopened this Apr 3, 2019

stale bot added the stale label Jan 11, 2020

stale bot removed the stale label Jan 13, 2020

stale bot added the stale label Feb 12, 2020

stale bot removed the stale label Feb 12, 2020

stale bot added the stale label Jul 16, 2020

stale bot removed the stale label Jul 16, 2020

stale bot added the stale label Aug 15, 2020

stale bot removed the stale label Aug 17, 2020

stale bot added the stale label Oct 16, 2020

stale bot removed the stale label Oct 16, 2020

stale bot added the stale label Dec 15, 2020

stale bot removed the stale label Dec 15, 2020

stale bot added the stale label Feb 14, 2021

stale bot removed the stale label Feb 15, 2021

LittleChimera mentioned this issue Mar 1, 2021

sidecar: Mark sidecar as unhealthy if bucket upload fails #3857

Closed

1 task

stale bot added the stale label Jun 12, 2021

stale bot closed this as completed Jun 28, 2021

Feature Request: Detect credentials changes without application restarts #985

Feature Request: Detect credentials changes without application restarts #985

Comments

Capitrium commented Mar 27, 2019

bwplotka commented Mar 27, 2019

Capitrium commented Mar 27, 2019

bwplotka commented Mar 27, 2019

Capitrium commented Mar 28, 2019

bwplotka commented Mar 28, 2019

Capitrium commented Mar 28, 2019

Capitrium commented Mar 29, 2019

bwplotka commented Apr 3, 2019

bwplotka commented Apr 3, 2019

Capitrium commented Apr 3, 2019

stale bot commented Jan 11, 2020

george-angel commented Jan 13, 2020

stale bot commented Feb 12, 2020

george-angel commented Feb 12, 2020

bwplotka commented Feb 12, 2020

george-angel commented Feb 12, 2020

prg3 commented Feb 12, 2020

bwplotka commented Feb 12, 2020 via email

kush-patel-hs commented Feb 13, 2020 • edited Loading

bwplotka commented Feb 13, 2020 via email

kush-patel-hs commented Feb 13, 2020 • edited Loading

stale bot commented Jul 16, 2020

george-angel commented Jul 16, 2020

stale bot commented Aug 15, 2020

george-angel commented Aug 17, 2020

stale bot commented Oct 16, 2020

george-angel commented Oct 16, 2020

stale bot commented Dec 15, 2020

george-angel commented Dec 15, 2020

stale bot commented Feb 14, 2021

george-angel commented Feb 15, 2021

LittleChimera commented Mar 1, 2021 • edited Loading

kakkoyun commented Mar 11, 2021

bwplotka commented Mar 19, 2021

brancz commented Apr 8, 2021

roidelapluie commented Apr 12, 2021

stale bot commented Jun 12, 2021

stale bot commented Jun 28, 2021

kush-patel-hs commented Feb 13, 2020 •

edited

Loading

kush-patel-hs commented Feb 13, 2020 •

edited

Loading

LittleChimera commented Mar 1, 2021 •

edited

Loading