This repository has been archived by the owner on Jul 26, 2022. It is now read-only.
fix(vault-backend): token ttl conditional renew #457
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Hashicorp Vault backend, as it exists, is a simplistic implementation. On the first Vault secret synchronization, when it has no auth token a Vault Kubernetes auth login is performed. On every single subsequent interaction with the Vault server to get a secret, KES first renews the token. As the volume of secrets being synchronized increases and/or with a low poller frequency, the token renew load on the Vault server gets very high. A token renewal is not a lightweight operation, requiring writing to a replicated backend database. When the renew rate is high, it negatively impacts the performance of Vault. It is far better to reuse a token for (most of) the duration of its TTL and renew only when necessary.
We witnessed this problem firsthand and it has been impacting us heavily. Our user base has been adopting ExternalSecrets very rapidly after we deployed KES onto 10 clusters. After a few weeks of KES usage increases, Vault performance began to suffer. This showed up as errors from the Vault GCS backend and slowness in the Vault UI. (Incidentally KES did not handle hard errors from Vault gracefully at all, choosing to die repetitively. This is another problem for another day.)
This patch addresses the issue by replacing the simple always-renew-token behaviour with a token lookup to check the remaining TTL and a conditional renewal. A token lookup is a lightweight read-only operation and provides the authoritative state of the token. The token is then renewed only if the remaining TTL is below a threshold. By default the threshold value is three times the polling interval. Since renewals are done during a poll, ie there is no asynchronous token renewal thread, to be robust we need to renew early enough so that at least one more polling interval remains to retry in the case of a transient renew failure. This default renewal threshold should be reasonable enough for most uses but to ensure flexibility, an environment variable was added to allow override of the default.
This is ultimately a simplistic fix to a simplistic implementation. One could imagine an implementation with a smarter KES client where token information is stored along with the token so even the token lookup could be optimized away. But that would be a more involved rewrite.