-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential fixes for flapping leadership #6951
Conversation
This fixes a condition where we may attempt to reconcile and establish leadership even after `stopCh` or `shutdownCh` is closed. Consider the case where a step down occurs during the leaderLoop Barrier call and/or it times out. The Barrier call times out after 2m by default, but reconcile interval defaults to 1m. Thus, both `<-stopCh` and `<-interval` clauses are ready in the `WAIT` select statement. Golang may arbitrary chose one, resulting into potentially unnecessary Barrier call. Here, we prioratize honoring `stopCh` and ensure we return early if Barrier or reconcilation fails.
If a non-leader is promoted but then immediately loses leadership, don't bother establishing it and attempting Barrier raft transactions
|
||
// if gained and lost leadership immediately, move on without emitting error | ||
if suppressed && !isLeader && weAreLeaderCh == nil { | ||
s.logger.Info("cluster leadership acquired but lost immediately") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does flapping indicate something an operator should be aware of such as excessive cpu load or network issues? If so we might want to document that in at least a code comment as I'm worried this log message may flummox the first users who hit it. If it does hint at some degenerate state, users who see this might already be stressed and confused, so if there's any more hints or context we can offer we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this, I've often found as an operator that info
log level is indecisive. If it's something I need to pay attention to and take action on, it should probably be at warn
. Otherwise it's telling me "hey this is something you want to know about" without telling me what to do with that information.
(My rule of thumb has often been that info
is for information at startup like "I'm listening on port 1234 and running with plugins X and Y", and then not used after that, but I don't think we have firm guidelines anywhere on Nomad.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, are this eliminate leaderCh
overflow? I think that is no. For example this is not remove situation when leadership lost, and you just increase leaderCh
buffer size which is not enough imho in all cases
// Returns: | ||
// leader: last buffered leadership state | ||
// suppressed: true if method dequeued elements from channel | ||
func suppressLeadershipFlaps(isLeader bool, ch <-chan bool) (leader, suppressed bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effectively what we're doing here is that if we get a notification that we're now the leader, we drain the notification channel to make sure that there hasn't been more than one notification since we handled the last one (in the outer loop of monitorLeadership
).
While this solves for a very tight loop of leadership flapping, I think we can still get into a flapping state if the leadership transitions are happening at a rate which is slower than the outer loop of monitorLeadership
-- in that case there will only ever be one notification in the queue at a time but the leadership loop will still be continuously flapping.
(Moving the WAIT:
label in leaderLoop
on the other hand I suspect will be really valuable for more realistic flapping cases.)
|
||
// if gained and lost leadership immediately, move on without emitting error | ||
if suppressed && !isLeader && weAreLeaderCh == nil { | ||
s.logger.Info("cluster leadership acquired but lost immediately") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this, I've often found as an operator that info
log level is indecisive. If it's something I need to pay attention to and take action on, it should probably be at warn
. Otherwise it's telling me "hey this is something you want to know about" without telling me what to do with that information.
(My rule of thumb has often been that info
is for information at startup like "I'm listening on port 1234 and running with plugins X and Y", and then not used after that, but I don't think we have firm guidelines anywhere on Nomad.)
// Setup the leader channel | ||
leaderCh := make(chan bool, 1) | ||
// Set up a channel for reliable leader notifications. | ||
leaderCh := make(chan bool, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-) Hm, in many cases, this will be enough, and why you extend this value? You already add suppressLeadershipFlaps
function which doesn't allow to overflow leaderCh
. Just in case?
Closing this one as it's superseeded by #6977 |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR attempts to pick up some issues found in #4749 . Issue 4749 has some other issues, but this addresses one of the problems associated with leadership flapping.
In particular, this PR addresses two issues:
First, if a non-leader is promoted but then immediately loses leadership, don't bother establishing it and attempting Barrier raft transactions. By using a buffered channel, before establishing leadership, we peek into rest of channel to see lost it before attempting to establish leadership. This should only occur when leadership is flapping while
leaderLoop
is running/shutting down.Second, this fixes a condition where we may reattempt to reconcile and establish leadership even after we lost raft leadership.
Consider the case where a step down occurs during the leaderLoop Barrier call and/or it times out. The Barrier call times out after 2m by default, but reconcile interval defaults to 1m. Thus, both
<-stopCh
and<-interval
clauses are ready in theWAIT
select statement. Golang may arbitrary chose one, resulting into potentially unnecessary Barrier call.Here, we prioritize honoring
stopCh
and ensure we return early if Barrier or reconciliation fails.