Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix server shutdown not waiting for worker run completion #19560

Conversation

marvinchin
Copy link
Contributor

Fixes #19556.

I'm not too sure how/where to add tests for this - please feel free to let me know and I'm happy to add the tests! For now, I've manually tested it following the repro listed in the issue and it seems to work as intended.

@hashicorp-cla
Copy link

hashicorp-cla commented Dec 27, 2023

CLA assistant check
All committers have signed the CLA.

@marvinchin
Copy link
Contributor Author

Hi @jrasell, sorry to bother but may I check how I can trigger the CI to re-run the tests?

nomad/server.go Outdated
@@ -315,9 +320,11 @@ type Server struct {
// NewServer is used to construct a new Nomad server from the
// configuration, potentially returning an error
func NewServer(config *Config, consulCatalog consul.CatalogAPI, consulConfigFunc consul.ConfigAPIFunc, consulACLs consul.ACLsAPI) (*Server, error) {
shutdownCtx, shutdownCancel := context.WithCancel(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: initiateShutdown would probably be a little bit clearer w.r.t. the intent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a couple of other places where this function is referred to as shutdownCancel (or some variation of it), so I thought it might be convention for this codebase.

No strong opinion about this - I'll probably leave this as it is for now to be closer with the precedent but happy to change it if maintainers prefer otherwise.

nomad/server.go Outdated
s.workerLock.Lock()
defer s.workerLock.Unlock()
s.stopOldWorkers(s.workers)
s.workerShutdownGroup.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are instances in which the worker shutdown can take considerable amount of time. My reading of one such instance is a blocking RPC call which takes a long time to process. Should we add a timeout around the Wait with an explicit error message on failure so that we do not end up blocking the shutdown of the server?

This is a relatively significant departure from the existing behavior in the common case, hence the question. It might be a nice idea to allow the user to specify the time for blocking, but that contract will become a slippery slope since it is difficult to enforce total shutdown duration, especially given that some of the calls may block for undefined amounts of time.

return &GenericNotifier{
publishCh: make(chan interface{}, 1),
subscribeCh: make(chan chan interface{}, 1),
unsubscribeCh: make(chan chan interface{}, 1),
ctx: ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels. The shutdown channel being passed into Run was inline with that design decision and I think we could continue with it here. Is there a deeper reason for passing the context in that I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I ran into with that was that I needed a way to coordinate between WaitForChange and Run.

In particular, the race condition that I encountered was:

  • WaitForChange tries to write to the unsubscribeCh and blocks because it is full
  • In Run because shutdownCh was closed and the function returns
  • Since Run has terminated nothing reads from unsubscribeCh and causes WaitForChange to block indefinitely

So, I think we need a way to have WaitForChange to unblock when it detects that the notifier has shut down. I suppose we could also represent this with another shutdownCh for the notifier itself, but I'm not sure which one is more idiomatic. Hopefully maintainers can chime in on which approach they prefer (or if there is a better way to do this!)

Copy link
Contributor

@stswidwinski stswidwinski Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I understand why using ctx here resolves the race which you describe. Doesn't it still occur based on subscribeCh in the same way, except on a different channel:

  1. Run shuts down because shutdownCtx was closed
  2. WaitForChange tries to write to subscribeCh and blocks because it is full (line 90)
  3. We block on line 90

I think the scenario in thish the context does help as you imply is one in which we may miss the notification on the shutdownCh and thus hang awaiting that event. That is the scenario of:

  1. WaitForChange times out
  2. shutdownCh is signalled
  3. Run quits
  4. WaitForChange is entered again
  5. We block as you imply

In this scenario 4 misses the notification on 2 because it wasn't actually awaiting a signal (my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I missed the other write to subscribeCh on L90. I'll push a fix.

we may miss the notification on the shutdownCh and thus hang awaiting that event

shutdownCh is closed when shutdown starts (rather than signaled). So, I don't think that's possible.

The race I was thinking is more long the lines of the producer not realizing that the consumer has stopped, and thus blocks indefinitely waiting for the consumer to do work.

I think in the scenario you described, the events I'm worried about is:

  1. WaitForChange blocks waiting for timeout
  2. shutdownCh is closed
  3. Run quits
  4. WaitForChange hits the timeout and unblocks, then as part of the deferred function it tries to to write to unsubscribeCh and gets blocked because it is full
  5. Because Run has quit, nothing reads from unsubscribeCh and so WaitForChange blocks forever

So, the shared context acts as a way for the producer to detect that the consumer has (or is soon going to) shut down and to not block waiting for it to do work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I missed the other write to subscribeCh on L90. I'll push a fix.

👍

shutdownCh is closed when shutdown starts (rather than signaled). So, I don't think that's possible.

Right, my bad. Makes sense :)

(my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen

Yup +1. I think this is what I had in mind, but I failed to describe it (you've captured it better). Agreed on the problem, up to the Hashi team to decide which way they would rather go :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels

FWIW the channel-heavy pattern is an unfortunate result of Nomad predating the context package by about a year. It would be nice to go back and clean things up but that was never a priority, and now it's channels all the way down.

The shutdown channel is used to signal that worker has stopped.
There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.
@marvinchin marvinchin force-pushed the fix-server-shutdown-not-waiting-for-worker-run-completion branch from 8df3b05 to 4dca4ce Compare January 3, 2024 07:39
@marvinchin marvinchin marked this pull request as ready for review January 3, 2024 08:20
@@ -364,7 +350,8 @@ func TestWorker_runBackoff(t *testing.T) {
workerCtx, workerCancel := context.WithCancel(srv.shutdownCtx)
defer workerCancel()

w := NewTestWorker(workerCtx, srv)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history of this function is a little confusing to me.

It was first introduced in #11593, but it wasn't actually used. Instead newWorker was the constructor used in all the tests. Later on this test was added in #15523 which uses this NewTestWorker instead of newWorker unlike the other tests.

My suspicion is that the usage of NewTestWorker over newWorker was accidental rather than intentional (but I could very well be mistaken). So, rather than update NewTestWorker to work with my changes I opted to switch the test to use newWorker like the other tests instead, and removed NewTestWorker entirely.

Hopefully a maintainer with more context can validate if my suspicions are valid (or if I'm just completely wrong!)

Copy link
Contributor

@stswidwinski stswidwinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving for maintainers to review. Please assume this is OK by me

@shoenig shoenig self-requested a review January 4, 2024 14:59
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; excellent work @marvinchin!

The only thing we need is a bugfix Changelog entry; you can run make cl to start little wizzard to create one.

@marvinchin
Copy link
Contributor Author

Thanks for the quick review! Changelog added.

@shoenig shoenig merged commit be8575a into hashicorp:main Jan 5, 2024
17 of 18 checks passed
@shoenig shoenig added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Jan 5, 2024
shoenig pushed a commit that referenced this pull request Jan 5, 2024
Fix server shutdown not waiting for worker run completion (#19560)

* Move group into a separate helper module for reuse

* Add shutdownCh to worker

The shutdown channel is used to signal that worker has stopped.

* Make server shutdown block on workers' shutdownCh

* Fix waiting for eval broker state change blocking indefinitely

There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.

* Bound the amount of time server shutdown waits on worker completion

* Fix lostcancel linter error

* Fix worker test using unexpected worker constructor

* Add changelog

---------

Co-authored-by: Marvin Chin <[email protected]>
shoenig pushed a commit that referenced this pull request Jan 5, 2024
Fix server shutdown not waiting for worker run completion (#19560)

* Move group into a separate helper module for reuse

* Add shutdownCh to worker

The shutdown channel is used to signal that worker has stopped.

* Make server shutdown block on workers' shutdownCh

* Fix waiting for eval broker state change blocking indefinitely

There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.

* Bound the amount of time server shutdown waits on worker completion

* Fix lostcancel linter error

* Fix worker test using unexpected worker constructor

* Add changelog

---------

Co-authored-by: Marvin Chin <[email protected]>
shoenig pushed a commit that referenced this pull request Jan 5, 2024
Fix server shutdown not waiting for worker run completion (#19560)

* Move group into a separate helper module for reuse

* Add shutdownCh to worker

The shutdown channel is used to signal that worker has stopped.

* Make server shutdown block on workers' shutdownCh

* Fix waiting for eval broker state change blocking indefinitely

There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.

* Bound the amount of time server shutdown waits on worker completion

* Fix lostcancel linter error

* Fix worker test using unexpected worker constructor

* Add changelog

---------

Co-authored-by: Marvin Chin <[email protected]>
Co-authored-by: Marvin Chin <[email protected]>
shoenig pushed a commit that referenced this pull request Jan 5, 2024
Fix server shutdown not waiting for worker run completion (#19560)

* Move group into a separate helper module for reuse

* Add shutdownCh to worker

The shutdown channel is used to signal that worker has stopped.

* Make server shutdown block on workers' shutdownCh

* Fix waiting for eval broker state change blocking indefinitely

There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.

* Bound the amount of time server shutdown waits on worker completion

* Fix lostcancel linter error

* Fix worker test using unexpected worker constructor

* Add changelog

---------

Co-authored-by: Marvin Chin <[email protected]>
Co-authored-by: Marvin Chin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line
Projects
Development

Successfully merging this pull request may close these issues.

Server does not wait for workers to submit nacks for dequeued evaluations before exiting
4 participants