Fix server shutdown not waiting for worker run completion #19560

marvinchin · 2023-12-27T10:26:22Z

Fixes #19556.

I'm not too sure how/where to add tests for this - please feel free to let me know and I'm happy to add the tests! For now, I've manually tested it following the repro listed in the issue and it seems to work as intended.

hashicorp-cla · 2023-12-27T10:26:26Z

All committers have signed the CLA.

marvinchin · 2023-12-28T12:03:07Z

Hi @jrasell, sorry to bother but may I check how I can trigger the CI to re-run the tests?

stswidwinski · 2023-12-28T14:15:14Z

nomad/server.go

@@ -315,9 +320,11 @@ type Server struct {
 // NewServer is used to construct a new Nomad server from the
 // configuration, potentially returning an error
 func NewServer(config *Config, consulCatalog consul.CatalogAPI, consulConfigFunc consul.ConfigAPIFunc, consulACLs consul.ACLsAPI) (*Server, error) {
+	shutdownCtx, shutdownCancel := context.WithCancel(context.Background())


NIT: initiateShutdown would probably be a little bit clearer w.r.t. the intent

I noticed a couple of other places where this function is referred to as shutdownCancel (or some variation of it), so I thought it might be convention for this codebase.

No strong opinion about this - I'll probably leave this as it is for now to be closer with the precedent but happy to change it if maintainers prefer otherwise.

stswidwinski · 2023-12-28T14:26:41Z

nomad/server.go

+	s.workerLock.Lock()
+	defer s.workerLock.Unlock()
+	s.stopOldWorkers(s.workers)
+	s.workerShutdownGroup.Wait()


There are instances in which the worker shutdown can take considerable amount of time. My reading of one such instance is a blocking RPC call which takes a long time to process. Should we add a timeout around the Wait with an explicit error message on failure so that we do not end up blocking the shutdown of the server?

This is a relatively significant departure from the existing behavior in the common case, hence the question. It might be a nice idea to allow the user to specify the time for blocking, but that contract will become a slippery slope since it is difficult to enforce total shutdown duration, especially given that some of the calls may block for undefined amounts of time.

stswidwinski · 2023-12-28T14:31:37Z

helper/broker/notify.go

 	return &GenericNotifier{
 		publishCh:     make(chan interface{}, 1),
 		subscribeCh:   make(chan chan interface{}, 1),
 		unsubscribeCh: make(chan chan interface{}, 1),
+		ctx:           ctx,


This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels. The shutdown channel being passed into Run was inline with that design decision and I think we could continue with it here. Is there a deeper reason for passing the context in that I'm missing?

The problem I ran into with that was that I needed a way to coordinate between WaitForChange and Run.

In particular, the race condition that I encountered was:

WaitForChange tries to write to the unsubscribeCh and blocks because it is full

In Run because shutdownCh was closed and the function returns

Since Run has terminated nothing reads from unsubscribeCh and causes WaitForChange to block indefinitely

So, I think we need a way to have WaitForChange to unblock when it detects that the notifier has shut down. I suppose we could also represent this with another shutdownCh for the notifier itself, but I'm not sure which one is more idiomatic. Hopefully maintainers can chime in on which approach they prefer (or if there is a better way to do this!)

I'm not sure that I understand why using ctx here resolves the race which you describe. Doesn't it still occur based on subscribeCh in the same way, except on a different channel:

Run shuts down because shutdownCtx was closed

WaitForChange tries to write to subscribeCh and blocks because it is full (line 90)

We block on line 90

I think the scenario in thish the context does help as you imply is one in which we may miss the notification on the shutdownCh and thus hang awaiting that event. That is the scenario of:

WaitForChange times out

shutdownCh is signalled

Run quits

WaitForChange is entered again

We block as you imply

In this scenario 4 misses the notification on 2 because it wasn't actually awaiting a signal (my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen.

Ah yeah, I missed the other write to subscribeCh on L90. I'll push a fix.

we may miss the notification on the shutdownCh and thus hang awaiting that event

shutdownCh is closed when shutdown starts (rather than signaled). So, I don't think that's possible.

The race I was thinking is more long the lines of the producer not realizing that the consumer has stopped, and thus blocks indefinitely waiting for the consumer to do work.

I think in the scenario you described, the events I'm worried about is:

WaitForChange blocks waiting for timeout

shutdownCh is closed

Run quits

WaitForChange hits the timeout and unblocks, then as part of the deferred function it tries to to write to unsubscribeCh and gets blocked because it is full

Because Run has quit, nothing reads from unsubscribeCh and so WaitForChange blocks forever

So, the shared context acts as a way for the producer to detect that the consumer has (or is soon going to) shut down and to not block waiting for it to do work.

Ah yeah, I missed the other write to subscribeCh on L90. I'll push a fix.

👍

shutdownCh is closed when shutdown starts (rather than signaled). So, I don't think that's possible.

Right, my bad. Makes sense :)

(my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen

Yup +1. I think this is what I had in mind, but I failed to describe it (you've captured it better). Agreed on the problem, up to the Hashi team to decide which way they would rather go :-)

This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels

FWIW the channel-heavy pattern is an unfortunate result of Nomad predating the context package by about a year. It would be nice to go back and clean things up but that was never a priority, and now it's channels all the way down.

The shutdown channel is used to signal that worker has stopped.

There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped.

marvinchin · 2024-01-03T08:27:58Z

nomad/worker_test.go

@@ -364,7 +350,8 @@ func TestWorker_runBackoff(t *testing.T) {
 	workerCtx, workerCancel := context.WithCancel(srv.shutdownCtx)
 	defer workerCancel()

-	w := NewTestWorker(workerCtx, srv)


The history of this function is a little confusing to me.

It was first introduced in #11593, but it wasn't actually used. Instead newWorker was the constructor used in all the tests. Later on this test was added in #15523 which uses this NewTestWorker instead of newWorker unlike the other tests.

My suspicion is that the usage of NewTestWorker over newWorker was accidental rather than intentional (but I could very well be mistaken). So, rather than update NewTestWorker to work with my changes I opted to switch the test to use newWorker like the other tests instead, and removed NewTestWorker entirely.

Hopefully a maintainer with more context can validate if my suspicions are valid (or if I'm just completely wrong!)

stswidwinski

Leaving for maintainers to review. Please assume this is OK by me

shoenig

LGTM; excellent work @marvinchin!

The only thing we need is a bugfix Changelog entry; you can run make cl to start little wizzard to create one.

marvinchin · 2024-01-05T00:36:11Z

Thanks for the quick review! Changelog added.

Fix server shutdown not waiting for worker run completion (#19560) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <[email protected]>

Fix server shutdown not waiting for worker run completion (#19560) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <[email protected]> Co-authored-by: Marvin Chin <[email protected]>

vercel bot deployed to Preview – nomad-storybook-and-ui December 27, 2023 10:30 View deployment

marvinchin force-pushed the fix-server-shutdown-not-waiting-for-worker-run-completion branch from 673f879 to c82d527 Compare December 28, 2023 05:16

vercel bot deployed to Preview – nomad-storybook-and-ui December 28, 2023 05:20 View deployment

marvinchin marked this pull request as ready for review December 28, 2023 06:07

marvinchin marked this pull request as draft December 28, 2023 06:07

stswidwinski reviewed Dec 28, 2023

View reviewed changes

marvinchin force-pushed the fix-server-shutdown-not-waiting-for-worker-run-completion branch from c82d527 to 8df3b05 Compare December 29, 2023 08:12

vercel bot deployed to Preview – nomad-storybook-and-ui December 29, 2023 08:16 View deployment

marvinchin added 7 commits January 3, 2024 02:36

Move group into a separate helper module for reuse

46992a9

Add shutdownCh to worker

60cd074

The shutdown channel is used to signal that worker has stopped.

Make server shutdown block on workers' shutdownCh

6953216

Bound the amount of time server shutdown waits on worker completion

81dcbc2

Fix lostcancel linter error

3c20f0c

Fix worker test using unexpected worker constructor

4dca4ce

marvinchin force-pushed the fix-server-shutdown-not-waiting-for-worker-run-completion branch from 8df3b05 to 4dca4ce Compare January 3, 2024 07:39

vercel bot deployed to Preview – nomad-storybook-and-ui January 3, 2024 07:43 View deployment

marvinchin marked this pull request as ready for review January 3, 2024 08:20

marvinchin commented Jan 3, 2024

View reviewed changes

marvinchin requested a review from stswidwinski January 4, 2024 08:51

stswidwinski reviewed Jan 4, 2024

View reviewed changes

shoenig self-requested a review January 4, 2024 14:59

shoenig approved these changes Jan 4, 2024

View reviewed changes

Add changelog

356ea03

vercel bot deployed to Preview – nomad-storybook-and-ui January 5, 2024 00:39 View deployment

shoenig approved these changes Jan 5, 2024

View reviewed changes

shoenig merged commit be8575a into hashicorp:main Jan 5, 2024
17 of 18 checks passed

shoenig added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix server shutdown not waiting for worker run completion #19560

Fix server shutdown not waiting for worker run completion #19560

marvinchin commented Dec 27, 2023

hashicorp-cla commented Dec 27, 2023 •

edited

Loading

marvinchin commented Dec 28, 2023

stswidwinski Dec 28, 2023

marvinchin Dec 28, 2023

stswidwinski Dec 28, 2023

stswidwinski Dec 28, 2023

marvinchin Dec 28, 2023

stswidwinski Dec 28, 2023 •

edited

Loading

marvinchin Dec 29, 2023

stswidwinski Dec 29, 2023

shoenig Jan 4, 2024

marvinchin Jan 3, 2024

stswidwinski left a comment

shoenig left a comment

marvinchin commented Jan 5, 2024

Fix server shutdown not waiting for worker run completion #19560

Fix server shutdown not waiting for worker run completion #19560

Conversation

marvinchin commented Dec 27, 2023

hashicorp-cla commented Dec 27, 2023 • edited Loading

marvinchin commented Dec 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stswidwinski Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stswidwinski left a comment

Choose a reason for hiding this comment

shoenig left a comment

Choose a reason for hiding this comment

marvinchin commented Jan 5, 2024

hashicorp-cla commented Dec 27, 2023 •

edited

Loading

stswidwinski Dec 28, 2023 •

edited

Loading