-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent checkwatcher from deadlocking from too many check restarts #5957
Prevent checkwatcher from deadlocking from too many check restarts #5957
Conversation
Each time the consul check watcher runs its main loop, it has three choices: 1. update watched checks, 2. check & restart failed allocs, or exit. Updating watched checks drains `checkUpdateCh` and restarts synchronously fill `checkUpdateCh`. If too many checks require a restart during the main loop, the `checkUpdateCh` channel will become full and block. Once this happens, the channel cannot be drained, and the check watcher deadlocks. To fix this, we will pass along along a bool to the prekill hook indicating that the restart is due to a check restart so that during task deregistration, we do not attempt to insert messages into the `checkUpdateCh` channel. These checks are removed by the check watcher once the task runner's Restart function returns anyway. Additionally, the deregister function no longer removes consul tasks twice. It used to do this when canary tasks had a different generated consul ID, but this is no longer the case, and the second task removal is unnecessary, only generating errors in nomad and consul logs. While this itself does not cause deadlocking, it helped to speed up the check watcher's deadlocking when it was present by inserting twice as many messages into the channel. Tests have been adjusted to account for this where I could find them.
Fixes #5395 Alternative to #5957 Make task restarting asynchronous when handling check-based restarts. This matches the pre-0.9 behavior where TaskRunner.Restart was an asynchronous signal. The check-based restarting code was not designed to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and could easily overwhelm the buffered update chan and deadlock. Many thanks to @byronwolfman for his excellent debugging, PR, and reproducer! I created this alternative as changing the functionality of TaskRunner.Restart has a much larger impact. This approach reverts to old known-good behavior and minimizes the number of places changes are made.
Amazing investigative work and a great PR! I created #5975 as an alternative since changing the behavior of TaskRunner.Restart has a large impact across many subsystems. It's what caused this bug to begin with! The idea behind #5975 is to restore the 0.8 behavior of Restart calls being asynchronous. This allows checkRestarter.Run to continue to apply updates without risking blocking. Thanks for your hard work investigating this. It made it extremely easy for me to reproduce and diagnose on my end! I'll leave this PR open until the team decides which approach is preferred. |
Thanks for the alternative PR @schmichael! Looks much cleaner than mine so I'm willing to bet it will have fewer unintended consequences. :) |
Closing this in lieu of #5975 |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #5395
Each time the consul check watcher runs its main loop, it has three
choices: 1. update watched checks, 2. check & restart failed allocs,
or 3. exit. Updating watched checks drains
checkUpdateCh
; restartssynchronously fill
checkUpdateCh
. If too many checks require arestart during the main loop, the
checkUpdateCh
channel will becomefull and block. Once this happens, the channel cannot be drained, and
the check watcher deadlocks.
A check restart travels through several functions, so here's a simplified
view of the run loop:
To fix this, we will pass along along a bool to the prekill hook
indicating that the restart is due to a check restart so that during
task deregistration, we do not attempt to insert messages into the
checkUpdateCh
channel. These checks are removed by the check watcheronce the task runner's Restart function returns anyway.
Here's a job file to help test for the deadlock:
A
count
of 7 is overkill but it helps speed things up. The job hasan intentionally faulty healthcheck. Run the job against a single node
using the 0.9.3 agent and then walk away for 5 minutes. The check
watcher should be deadlocked, and attempting to stop the job or drain
the node will fail/block. The same job file with this PR's changes
should be able to restart to its heart's content.
Additionally, the deregister function no longer removes consul tasks
twice. It used to do this when canary tasks had a different generated
consul ID, but this is no longer the case, and the second task removal
is unnecessary, only generating errors in nomad and consul logs. While
this itself does not cause deadlocking, it helped to speed up the
check watcher's deadlocking when it was present by inserting twice as
many messages into the channel.
A final addition: alloc restarts due to check restarts will no longer
wait through the shutdown delay. Unhealthy allocations should probably
just be restarted right away, and the use of the shutdown delay in the
prekill hook is probably an unintentional change due to the giant 0.9
refactor and alloc lifecycle changes.
Tests have been adjusted to account for this where I could find them.
I don't really write golang so I apologize if there are better or more
idiomatic approaches to this sort of change. Same goes for tests
because I am definitely lost when it comes to the testing apparatus on
a giant golang project like this.