-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock situation when leader is flapping #6852
Comments
For posterity, I tried to re-follow this chain and there are a couple of links missing in the description I had to figure out again.
This is still pretty hard to hold in your head! But a possible sequence of events here between the 3+ goroutines involved in the deadlock cycle could be:
|
Bear in mind that this is on a heavily loaded server so some of these operations that should take microseconds are likely struggling to get CPU time and raft is flappy because not being serviced in a timely way. So if this seems implausible/hard to replicate it would be in a controlled way, but once server is overloaded chances of hitting something like this are much higher. |
Overview of the Issue
When running some load testing, we saw two of the servers having more goroutines than we expected. And even after our load testing was done, the number keept going up. The logs showed that the leader was flapping and one important bit is that the same server lost and acquired leadership in the same second.
Dumping the goroutines helped us understand what was going on:
monitorLeadership
is startingleaderLoop
and adds to a waitgroup:consul/agent/consul/leader.go
Lines 74 to 78 in fd3c56f
monitorLeadership
is waiting on that waitgroup before it can finally shutdown. it is done that way to make sure consul only ever runs a singleleaderLoop
:consul/agent/consul/leader.go
Lines 88 to 90 in fd3c56f
consul/agent/consul/server.go
Line 725 in 66d138f
true
to the channel to indicate that leadership was acquired and now it wants to writefalse
, but it can't because it is blocked. Now the raft leaderloop is blocked:consul/vendor/github.com/hashicorp/raft/raft.go
Lines 426 to 434 in fd3c56f
PromoteNonVoters
while it was still leader, but is still running, it now blocks on getting the raft configuration because raft doesn't run the loop anymore since it is waiting to write to the notify channel:consul/agent/consul/autopilot.go
Lines 67 to 68 in fd3c56f
Fixes
consul/vendor/github.com/hashicorp/raft/config.go
Lines 186 to 189 in fd3c56f
Related
Raw goroutine dump of deadlocked server
https://gist.github.com/banks/e6d14f2f94a49bbba48d52023b643008
The text was updated successfully, but these errors were encountered: