fix(vault-backend): token ttl conditional renew #457

EricHorst · 2020-07-28T01:00:44Z

The Hashicorp Vault backend, as it exists, is a simplistic implementation. On the first Vault secret synchronization, when it has no auth token a Vault Kubernetes auth login is performed. On every single subsequent interaction with the Vault server to get a secret, KES first renews the token. As the volume of secrets being synchronized increases and/or with a low poller frequency, the token renew load on the Vault server gets very high. A token renewal is not a lightweight operation, requiring writing to a replicated backend database. When the renew rate is high, it negatively impacts the performance of Vault. It is far better to reuse a token for (most of) the duration of its TTL and renew only when necessary.

We witnessed this problem firsthand and it has been impacting us heavily. Our user base has been adopting ExternalSecrets very rapidly after we deployed KES onto 10 clusters. After a few weeks of KES usage increases, Vault performance began to suffer. This showed up as errors from the Vault GCS backend and slowness in the Vault UI. (Incidentally KES did not handle hard errors from Vault gracefully at all, choosing to die repetitively. This is another problem for another day.)

This patch addresses the issue by replacing the simple always-renew-token behaviour with a token lookup to check the remaining TTL and a conditional renewal. A token lookup is a lightweight read-only operation and provides the authoritative state of the token. The token is then renewed only if the remaining TTL is below a threshold. By default the threshold value is three times the polling interval. Since renewals are done during a poll, ie there is no asynchronous token renewal thread, to be robust we need to renew early enough so that at least one more polling interval remains to retry in the case of a transient renew failure. This default renewal threshold should be reasonable enough for most uses but to ensure flexibility, an environment variable was added to allow override of the default.

This is ultimately a simplistic fix to a simplistic implementation. One could imagine an implementation with a smarter KES client where token information is stored along with the token so even the token lookup could be optimized away. But that would be a more involved rewrite.

refactor(logger): update to support pino@7 (#445)

update to upstream

Flydiverny

Thanks for the contribution! Looks good to me :)

EricHorst added 5 commits July 22, 2020 19:00

Merge pull request #1 from godaddy/master

189a52d

refactor(logger): update to support pino@7 (#445)

Merge pull request #2 from godaddy/master

5af1c5a

update to upstream

fix(vault-backend): token ttl conditional renew

51bac07

fix(vault-backend): grr, trailing whitespace

06e9afc

fix(vault-backend): add and correct tests

445b10f

Flydiverny approved these changes Jul 31, 2020

View reviewed changes

Flydiverny added the vault label Jul 31, 2020

Flydiverny merged commit a52987b into external-secrets:master Jul 31, 2020

rvaidya mentioned this pull request Aug 16, 2020

GCS Vault Backend Rate Limiting #468

Closed

megakid mentioned this pull request Sep 18, 2020

Specifying unique vaultRole per external secrets result in permission denied for all but one #487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vault-backend): token ttl conditional renew #457

fix(vault-backend): token ttl conditional renew #457

EricHorst commented Jul 28, 2020

Flydiverny left a comment

fix(vault-backend): token ttl conditional renew #457

fix(vault-backend): token ttl conditional renew #457

Conversation

EricHorst commented Jul 28, 2020

Flydiverny left a comment

Choose a reason for hiding this comment