-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vault: expired tokens count toward batch limit #8553
Conversation
21f8f71
to
49db531
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM . testing nitpicks aside that you can ignore.
nomad/vault_test.go
Outdated
if err != nil { | ||
t.Fatalf("failed to build vault client: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use require
for consistency with other assertions:
if err != nil { | |
t.Fatalf("failed to build vault client: %v", err) | |
} | |
require.NoError(t, err) |
v.Config.Token = defaultTestVaultWhitelistRoleAndToken(v, t, 5) | ||
|
||
// Disable client until we can change settings for testing | ||
conf := v.Config.Copy() | ||
conf.Enabled = helper.BoolToPtr(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that copying isn't necessary here since we are still initializing the config. If not, unclear why the token can be modified without copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that vaultClient
keeps a reference to the Config, so I copied it out of an abundance of caution.
resultCh <- nil | ||
} | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good assertion as-is, I would like to add a check that we actually received all messages, not that we received 3 batches not bigger than expected size.
Maybe add a counter for number of purge calls, and how many accessors received so far. Also, you can send nil
on resultCh
(or close it) only when received all accessors over 3 batches and then you can avoid the loop below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better idea, thanks.
@@ -1305,7 +1311,7 @@ func (v *vaultClient) revokeDaemon() { | |||
revoking = append(revoking, va) | |||
} | |||
|
|||
if len(revoking) >= toRevoke { | |||
if len(revoking)+len(ttlExpired) >= toRevoke { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment indicating that maxVaultRevokeBatchSize
is meant to constraint the batch size for both submitting requests to Vault as well as restrict the size of Raft messages - hence needing to account for ttlExpired too.
49db531
to
7c9bec5
Compare
As of 0.11.3 Vault token revocation and purging was done in batches. However the batch size was only limited by the number of *non-expired* tokens being revoked. Due to bugs prior to 0.11.3, *expired* tokens were not properly purged. Long-lived clusters could have thousands to *millions* of very old expired tokens that never got purged from the state store. Since these expired tokens did not count against the batch limit, very large batches could be created and overwhelm servers. This commit ensures expired tokens count toward the batch limit with this one line change: ``` - if len(revoking) >= toRevoke { + if len(revoking)+len(ttlExpired) >= toRevoke { ``` However, this code was difficult to test due to being in a periodically executing loop. Most of the changes are to make this one line change testable and test it.
7c9bec5
to
4a14604
Compare
Is this PR an appropriate place to request some metrics around this? I think it'd be a useful bit to have # of tokens that are active/to-be-revoked/are-revoking as a way to see things like high job churn, or a rare case I've seen where tokens just seem to be lost (e.g.: i've seen allocations with {{ with secret "pki" "ttl=7d" }} that never refreshes). |
@chuckyz That's a great idea. At a glance I couldn't decide exactly what metric to add, so I'm going to go ahead and merge this PR. Please open up an issue if you can think of exact metric to track here, otherwise I'll probably aim for tracking the pending revocations? |
*Cherry-pick of #8553 to branch off of v0.11.3 tag.* As of 0.11.3 Vault token revocation and purging was done in batches. However the batch size was only limited by the number of *non-expired* tokens being revoked. Due to bugs prior to 0.11.3, *expired* tokens were not properly purged. Long-lived clusters could have thousands to *millions* of very old expired tokens that never got purged from the state store. Since these expired tokens did not count against the batch limit, very large batches could be created and overwhelm servers. This commit ensures expired tokens count toward the batch limit with this one line change: ``` - if len(revoking) >= toRevoke { + if len(revoking)+len(ttlExpired) >= toRevoke { ``` However, this code was difficult to test due to being in a periodically executing loop. Most of the changes are to make this one line change testable and test it.
Fix some capitalization too.
docs: add #8553 to changelog
This log line should be rare since: 1. Most tokens should be logged synchronously, not via this async batched method. Async revocation only takes place when Vault connectivity is lost and after leader election so no revocations are missed. 2. There should rarely be >1 batch (1,000) tokens to revoke since the above conditions should be brief and infrequent. 3. Interval is 5 minutes, so this log line will be emitted at *most* once every 5 minutes. What makes this log line rare is also what makes it interesting: due to a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore Nomad tries to re-revoke them on every leader election. This caused a massive buildup of old tokens that would never be properly revoked and purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging revoked tokens via Raft (fixed in #8553). The nomad.vault.distributed_tokens_revoked metric is only ticked upon successful revocation and purging, making any bugs or slowness in the process difficult to detect. Logging before a potentially slow revocation+purge operation is performed will give users much better indications of what activity is going on should the process fail to make it to the metric.
This log line should be rare since: 1. Most tokens should be logged synchronously, not via this async batched method. Async revocation only takes place when Vault connectivity is lost and after leader election so no revocations are missed. 2. There should rarely be >1 batch (1,000) tokens to revoke since the above conditions should be brief and infrequent. 3. Interval is 5 minutes, so this log line will be emitted at *most* once every 5 minutes. What makes this log line rare is also what makes it interesting: due to a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore Nomad tries to re-revoke them on every leader election. This caused a massive buildup of old tokens that would never be properly revoked and purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging revoked tokens via Raft (fixed in #8553). The nomad.vault.distributed_tokens_revoked metric is only ticked upon successful revocation and purging, making any bugs or slowness in the process difficult to detect. Logging before a potentially slow revocation+purge operation is performed will give users much better indications of what activity is going on should the process fail to make it to the metric.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
As of 0.11.3 Vault token revocation and purging was done in batches.
However the batch size was only limited by the number of non-expired
tokens being revoked.
Due to bugs prior to 0.11.3, expired tokens were not properly purged.
Long-lived clusters could have thousands to millions of very old
expired tokens that never got purged from the state store.
Since these expired tokens did not count against the batch limit, very
large batches could be created and overwhelm servers.
This commit ensures expired tokens count toward the batch limit with
this one line change:
However, this code was difficult to test due to being in a periodically
executing loop. Most of the changes are to make this one line change
testable and test it.