-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix server shutdown not waiting for worker run completion #19560
Fix server shutdown not waiting for worker run completion #19560
Conversation
673f879
to
c82d527
Compare
Hi @jrasell, sorry to bother but may I check how I can trigger the CI to re-run the tests? |
nomad/server.go
Outdated
@@ -315,9 +320,11 @@ type Server struct { | |||
// NewServer is used to construct a new Nomad server from the | |||
// configuration, potentially returning an error | |||
func NewServer(config *Config, consulCatalog consul.CatalogAPI, consulConfigFunc consul.ConfigAPIFunc, consulACLs consul.ACLsAPI) (*Server, error) { | |||
shutdownCtx, shutdownCancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: initiateShutdown
would probably be a little bit clearer w.r.t. the intent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a couple of other places where this function is referred to as shutdownCancel
(or some variation of it), so I thought it might be convention for this codebase.
No strong opinion about this - I'll probably leave this as it is for now to be closer with the precedent but happy to change it if maintainers prefer otherwise.
nomad/server.go
Outdated
s.workerLock.Lock() | ||
defer s.workerLock.Unlock() | ||
s.stopOldWorkers(s.workers) | ||
s.workerShutdownGroup.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are instances in which the worker shutdown can take considerable amount of time. My reading of one such instance is a blocking RPC call which takes a long time to process. Should we add a timeout around the Wait with an explicit error message on failure so that we do not end up blocking the shutdown of the server?
This is a relatively significant departure from the existing behavior in the common case, hence the question. It might be a nice idea to allow the user to specify the time for blocking, but that contract will become a slippery slope since it is difficult to enforce total shutdown duration, especially given that some of the calls may block for undefined amounts of time.
return &GenericNotifier{ | ||
publishCh: make(chan interface{}, 1), | ||
subscribeCh: make(chan chan interface{}, 1), | ||
unsubscribeCh: make(chan chan interface{}, 1), | ||
ctx: ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels. The shutdown channel being passed into Run
was inline with that design decision and I think we could continue with it here. Is there a deeper reason for passing the context in that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I ran into with that was that I needed a way to coordinate between WaitForChange
and Run
.
In particular, the race condition that I encountered was:
WaitForChange
tries to write to theunsubscribeCh
and blocks because it is full- In
Run
becauseshutdownCh
was closed and the function returns - Since
Run
has terminated nothing reads fromunsubscribeCh
and causesWaitForChange
to block indefinitely
So, I think we need a way to have WaitForChange
to unblock when it detects that the notifier has shut down. I suppose we could also represent this with another shutdownCh
for the notifier itself, but I'm not sure which one is more idiomatic. Hopefully maintainers can chime in on which approach they prefer (or if there is a better way to do this!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I understand why using ctx
here resolves the race which you describe. Doesn't it still occur based on subscribeCh
in the same way, except on a different channel:
- Run shuts down because
shutdownCtx
was closed WaitForChange
tries to write tosubscribeCh
and blocks because it is full (line 90)- We block on line 90
I think the scenario in thish the context does help as you imply is one in which we may miss the notification on the shutdownCh
and thus hang awaiting that event. That is the scenario of:
WaitForChange
times outshutdownCh
is signalledRun
quitsWaitForChange
is entered again- We block as you imply
In this scenario 4 misses the notification on 2 because it wasn't actually awaiting a signal (my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, I missed the other write to subscribeCh
on L90. I'll push a fix.
we may miss the notification on the shutdownCh and thus hang awaiting that event
shutdownCh
is closed when shutdown starts (rather than signaled). So, I don't think that's possible.
The race I was thinking is more long the lines of the producer not realizing that the consumer has stopped, and thus blocks indefinitely waiting for the consumer to do work.
I think in the scenario you described, the events I'm worried about is:
WaitForChange
blocks waiting for timeoutshutdownCh
is closedRun
quitsWaitForChange
hits the timeout and unblocks, then as part of the deferred function it tries to to write tounsubscribeCh
and gets blocked because it is full- Because
Run
has quit, nothing reads fromunsubscribeCh
and soWaitForChange
blocks forever
So, the shared context acts as a way for the producer to detect that the consumer has (or is soon going to) shut down and to not block waiting for it to do work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, I missed the other write to subscribeCh on L90. I'll push a fix.
👍
shutdownCh is closed when shutdown starts (rather than signaled). So, I don't think that's possible.
Right, my bad. Makes sense :)
(my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen
Yup +1. I think this is what I had in mind, but I failed to describe it (you've captured it better). Agreed on the problem, up to the Hashi team to decide which way they would rather go :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels
FWIW the channel-heavy pattern is an unfortunate result of Nomad predating the context package by about a year. It would be nice to go back and clean things up but that was never a priority, and now it's channels all the way down.
c82d527
to
8df3b05
Compare
The shutdown channel is used to signal that worker has stopped.
There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped.
8df3b05
to
4dca4ce
Compare
@@ -364,7 +350,8 @@ func TestWorker_runBackoff(t *testing.T) { | |||
workerCtx, workerCancel := context.WithCancel(srv.shutdownCtx) | |||
defer workerCancel() | |||
|
|||
w := NewTestWorker(workerCtx, srv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The history of this function is a little confusing to me.
It was first introduced in #11593, but it wasn't actually used. Instead newWorker
was the constructor used in all the tests. Later on this test was added in #15523 which uses this NewTestWorker
instead of newWorker
unlike the other tests.
My suspicion is that the usage of NewTestWorker
over newWorker
was accidental rather than intentional (but I could very well be mistaken). So, rather than update NewTestWorker
to work with my changes I opted to switch the test to use newWorker
like the other tests instead, and removed NewTestWorker
entirely.
Hopefully a maintainer with more context can validate if my suspicions are valid (or if I'm just completely wrong!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving for maintainers to review. Please assume this is OK by me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; excellent work @marvinchin!
The only thing we need is a bugfix Changelog entry; you can run make cl
to start little wizzard to create one.
Thanks for the quick review! Changelog added. |
Fix server shutdown not waiting for worker run completion (#19560) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <[email protected]>
Fix server shutdown not waiting for worker run completion (#19560) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <[email protected]>
Fix server shutdown not waiting for worker run completion (#19560) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <[email protected]> Co-authored-by: Marvin Chin <[email protected]>
Fix server shutdown not waiting for worker run completion (#19560) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <[email protected]> Co-authored-by: Marvin Chin <[email protected]>
Fixes #19556.
I'm not too sure how/where to add tests for this - please feel free to let me know and I'm happy to add the tests! For now, I've manually tested it following the repro listed in the issue and it seems to work as intended.