-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up leadership establishment #8036
Conversation
Establishing leadership should be very fast and never make external API calls. This fixes a situation where there is a long backlog of Vault tokens to be revoked on when leadership is gained. In such case, revoking the tokens will significantly slow down leadership establishment and slow down processing. Worse, the revocation call does not honor leadership `stopCh` signals, so it will not stop when the leader loses leadership.
Start serving RPC immediately after leader components are enabled, and move clean up to the bottom as they don't block leadership responsibilities.
If a token is scheduled for revocation expires before we revoke it, ensure that it is marked as purged in raft and is only removed from local vault state if the purge operation succeeds. Prior to this change, we may remove the accessor from local state but not purge it from Raft. This causes unnecessary and churn in the next leadership elections (and until 0.11.2 result in indefinite retries).
purgeFunc cannot be nil, so ensure it's set to a no-op function in tests.
Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Comments are just wording tweaks and not blockers.
nomad/leader.go
Outdated
// Activate RPCs after all leadership components are enabled | ||
// and the server can handle write RPCs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems a bit off now that not all "leadership components" are enabled at this point (although I guess all of them that set the SetEnabled pattern are).
Is there a specific operation that makes the leader ready to serve consistent reads? If so let's specifically call that one out so that as we add/remove/move calls in this func we have a way to determine whether they need to come before or after this point.
Co-authored-by: Michael Schurter <[email protected]>
e9d14a3
to
5bf4868
Compare
5bf4868
to
127a960
Compare
…restore Speed up leadership establishment
cherry-picked into 0.11.3 |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Establishing leadership should be very fast and never make potentially slow external API calls.
This PR makes the following changes:
Context and Motivation
Upon upgrading to 0.11.2, a cluster suffered many errors related to Nomad being slow to establish leadership and failing to respond to RPCs with "Not ready to serve consistent reads". The cause was that
revokeVaultAccessorsOnRestore
was running for along time and blocking the leader from performing its duties.There are few moving parts in 0.11.2 that contributed to the issue:
The cluster had a very large number of vault tokens to be revoked. In one instance in the cluster history, over a million Vault tokens were revoked all at once! Presumably when the server gained leadership, many of the tokens were already revoked, either because Token TTL expired or because the previous leader revoke them prior before it lost leadership.
In processing this, two problems happen:
First, Nomad attempts to revoke the tokens in leadership establishment goroutine and block further leadership establishment. Changes in #5911 made the situation more visible. Prior to that PR, long vault revocation causes the leader to be half-wedged state, it may respond to RPCs but will not establish node heartbeat timers or schedule periodic jobs. That PR made the situation more obvious.
However, a second bug lessened the severity of the first bug: Vault revocation call is flaky and may terminate quickly. Prior to #7959 , if some of the tokens were already revoked or expired already,
RevokeToken
call fails fast and but it retries indefinitely in the background as none of the retries will succeed; this probably revokeDaemon may never actually revoke any tokens as the retries will keep failing with these tokens. PR #7959 made vault revocation thread keep going even with already revoked tokens. However, this also means that now that the revocation thread will process that long list of tokens in leader establishment thread and slow everything to a halt!This PR addresses both of these problems by ensuring that the leadership establishment is fast, that token revocation always happen in background. Also, given that cleanup processes aren't critical or time sensitive to leader functionality, we move their order so we prioritize other more critical functionality (periodic jobs, node heartbeat tracking, etc).
Furthermore, when processing a super long list of revocation, we only check for the token TTL at the start to avoid revoking already expired tokens. But if the token TTLs are short (~60 seconds) relative to revocation of the long list, we may overwhelm Vault by making a lot of unnecessary API calls. Also, if any of their revocation fails, nomad reattempts to revoke all of them again!
This PR breaks the background revocation to be in 1000-token batches. This reduces the burden on Vault a bit and makes the retry handle for failed tokens a tad better (retry at most 1000 instead of 1 million tokens).