Skip to content
This repository has been archived by the owner on Jul 26, 2022. It is now read-only.

fix(vault-backend): token ttl conditional renew #457

Merged
merged 5 commits into from
Jul 31, 2020
Merged

fix(vault-backend): token ttl conditional renew #457

merged 5 commits into from
Jul 31, 2020

Conversation

EricHorst
Copy link
Contributor

The Hashicorp Vault backend, as it exists, is a simplistic implementation. On the first Vault secret synchronization, when it has no auth token a Vault Kubernetes auth login is performed. On every single subsequent interaction with the Vault server to get a secret, KES first renews the token. As the volume of secrets being synchronized increases and/or with a low poller frequency, the token renew load on the Vault server gets very high. A token renewal is not a lightweight operation, requiring writing to a replicated backend database. When the renew rate is high, it negatively impacts the performance of Vault. It is far better to reuse a token for (most of) the duration of its TTL and renew only when necessary.

We witnessed this problem firsthand and it has been impacting us heavily. Our user base has been adopting ExternalSecrets very rapidly after we deployed KES onto 10 clusters. After a few weeks of KES usage increases, Vault performance began to suffer. This showed up as errors from the Vault GCS backend and slowness in the Vault UI. (Incidentally KES did not handle hard errors from Vault gracefully at all, choosing to die repetitively. This is another problem for another day.)

This patch addresses the issue by replacing the simple always-renew-token behaviour with a token lookup to check the remaining TTL and a conditional renewal. A token lookup is a lightweight read-only operation and provides the authoritative state of the token. The token is then renewed only if the remaining TTL is below a threshold. By default the threshold value is three times the polling interval. Since renewals are done during a poll, ie there is no asynchronous token renewal thread, to be robust we need to renew early enough so that at least one more polling interval remains to retry in the case of a transient renew failure. This default renewal threshold should be reasonable enough for most uses but to ensure flexibility, an environment variable was added to allow override of the default.

This is ultimately a simplistic fix to a simplistic implementation. One could imagine an implementation with a smarter KES client where token information is stored along with the token so even the token lookup could be optimized away. But that would be a more involved rewrite.

Copy link
Member

@Flydiverny Flydiverny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Looks good to me :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants