Manager does not wait for Runnables to stop #350

grantr · 2019-03-05T23:45:14Z

Currently Manager.Start immediately returns when stop channel is closed without waiting for Runnables to stop:

controller-runtime/pkg/manager/internal.go

Lines 216 to 219 in 6649bdb

    
           select { 
        
           case <-stop: 
        
           	// We are done 
        
           	return nil

This makes writing a Runnable with blocking stop logic difficult or impossible. For example, the Manager's internal serveMetrics is basically a Runnable (a blocking method that takes a stop channel). When the stop channel is closed, serveMetrics tries to shut down its http server:

controller-runtime/pkg/manager/internal.go

Lines 188 to 192 in 6649bdb

    
           select { 
        
           case <-stop: 
        
           	if err := server.Shutdown(context.Background()); err != nil { 
        
           		cm.errChan <- err 
        
           	}

But in normal usage, the process immediately exits when Manager.Start returns so the Shutdown call is unlikely to complete.

Adding sync.WaitGroup accounting inside the Runnable wrapper goroutines would allow the addition of a Wait() method to the Manager interface that blocks until all Runnables have returned (or WaitWithTimeout(time.Duration), Shutdown(context.Context) for a termination timeout), without changing either the Runnable contract or the Manager contract:

if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
    log.Error(err, "unable to run the manager")
    os.Exit(1)
}
mgr.Wait()

The text was updated successfully, but these errors were encountered:

DirectXMan12 · 2019-03-06T02:02:46Z

/kind bug
/priority important-soon

good catch

rajathagasthya · 2019-03-06T16:39:26Z

I'll take this!

DirectXMan12 · 2019-03-06T22:47:55Z

awesome!

grantr · 2019-03-06T23:22:23Z

While investigating this I discovered another issue: the Manager's errChan is unbuffered, so only one Runnable goroutine (including the metrics server) will be able to terminate correctly. Others will block indefinitely. An easy fix for this is making the errChan buffered, but that's not a great long-term solution.

Anyone working on solutions in this area might want to keep #330 in mind too.

Manager doesn't actually wait for Runnables to stop, so we need to add a WaitGroup to wait for the HTTP Server Shutdown to complete. This is hopefully temporary until kubernetes-sigs/controller-runtime#350 is fixed.

* Add a prometheus server to broker ingress * correct method comment * Add a metrics port to the broker's ingress Service Ultimately access to this port might need different permissions than the default (ingress) port. * Always shut down runnableServer If ShutdownTimeout is not positive, call Shutdown anyway without a timeout. * Add port values to the ingress container spec These are informational, and more useful now that two ports are exposed. * Instrument message count and dispatch time Also register the exporter so it can start serving metrics. * Simplify runnableServer shutdown No need to select since we're just waiting for the channel to be closed. * Use a WaitGroup to stop runnableServers Manager doesn't actually wait for Runnables to stop, so we need to add a WaitGroup to wait for the HTTP Server Shutdown to complete. This is hopefully temporary until kubernetes-sigs/controller-runtime#350 is fixed. * Hook up internal controller-runtime logger * Include GCP auth library Running outside GCP seems to not work without this. * Translate commented code into a directive * Tag measurements with the broker name The BROKER environment variable may contain the broker name. If non-empty, measurements will be tagged with the given string. * Make BROKER env var required There's no use case for running this without an existing broker context, so just require the BROKER name to be present. * Add a separate shutdownTimeout var Currently the same as writeTimeout, but allows for having separate write and shutdown timeouts later. * Add a shutdown timer for the waitgroup wg.Wait() can block indefinitely. Adding a timer here ensures the process shutdown time is bounded. * Don't redeclare brokerName We want to use the package var here. * Move wg.Add outside the goroutine This eliminates a case in which wg.Done could be called before wg.Add depending on how the goroutine is scheduled. * Use Fatalf instead of Fatal * Update Gopkg.lock * Get broker name from env var BROKER It's now a struct field instead of a package var. * Use ok instead of success For consistency with HTTP error codes. * Test RunnableServer Tests ensure the server starts, responds to requests, and stops or shuts down as requested. * Set brokerName in handler struct This was accidentally removed in a merge. * Format and expand comments on shutdown behavior * Remove unnecessary comments * Update copyright year * Add test to verify correct usage of context Verifies that the context is correctly timing out shutdown. * Remove logging from RunnableServer Return the error instead and let the caller decide whether to log it. The ShutdownContext test now occasionally flakes; still tracking that down. * Attempt to document metrics Lists metrics exposed by Broker ingress and the port on which they're exposed. As we add metrics to other components, we can list them in this file. * Improve stability of shutdown test Request goroutine needs a bit more time to start sometimes. Removed Logf calls from goroutines since they sometimes happen after the test completes, causing a panic. Removed the error check from the http get for the same reason. * Rename logf to crlog Documents source of package more clearly. * Remove obsolete logger field from RunnableServer

fejta-bot · 2019-06-04T23:55:14Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rajathagasthya · 2019-06-05T00:42:00Z

/remove-lifecycle stale

Sorry, haven't had a chance to look at this. Will do shortly. Might need your guidance a bit @DirectXMan12

DirectXMan12 · 2019-06-05T22:33:27Z

ack, lmk if/when you need my guidance

rajathagasthya · 2019-08-20T19:08:06Z

@DirectXMan12 I'm starting to take a look at this. Could you guide me on what a solution could look like?

DirectXMan12 · 2019-08-22T22:39:23Z

off the top of my head, we'd need some way to check that the runnables each terminate. A wait group, as suggested, seems reasonable. Just rolling it into Start instead of a separate wait method could be reasonable, but I'd imaging that'd have some issues too (like delayed error reporting). Need to weigh the pros and cons there.

DirectXMan12 · 2019-08-22T22:39:31Z

probably at least log immediately

fejta-bot · 2019-11-20T23:11:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

seh · 2019-11-21T05:24:49Z

I’m bitten by this problem with my use of an instrumentation package that I try to shut down cleanly before the process exits. My predicates and reconcilers are making calls into this package, and there’s no way I can see that I can wait for them all to finish before trying to close the connection they’re all using.

It’s one thing to know that the manager observed the “stop” channel becoming ready, when it returns from Start, but that says nothing about the current state of the Runnables—even those that ask for the “stop” channel to be injected.

Using async.WaitGroup sounds like the way to go. Whether delaying the return from Start should be something you can opt into or out of warrants consideration.

alexeldeib · 2019-12-02T21:28:04Z

/remove-lifecycle stale

vincepri · 2020-02-20T18:39:13Z

Closing in favor of #764

/close

k8s-ci-robot · 2020-02-20T18:39:15Z

@vincepri: Closing this issue.

In response to this:

Closing in favor of #764

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 6, 2019

grantr mentioned this issue Mar 6, 2019

Broker ingress metrics Harwayne/knative-eventing#23

Closed

grantr mentioned this issue Mar 21, 2019

✨ pkg/manager,metrics: Expose ServingMetrics func #367

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2019

joelanford mentioned this issue Aug 21, 2019

Go operator doesn't shutdown gracefully on SIGTERM operator-framework/operator-sdk#1843

Closed

mhrivnak mentioned this issue Oct 8, 2019

Don't use readiness condition to detect if updates are in progress operator-framework/operator-lifecycle-manager#922

Closed

dbenque added a commit to DataDog/controller-runtime that referenced this issue Oct 28, 2019

Fix kubernetes-sigs#429 and kubernetes-sigs#350

875bd0a

dbenque mentioned this issue Oct 29, 2019

🐛 Wait for runnables to stop fix for #350 and #429 #664

Closed

dbenque added a commit to DataDog/controller-runtime that referenced this issue Oct 29, 2019

Fix kubernetes-sigs#429 and kubernetes-sigs#350

97d3e5f

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2019

wallrj mentioned this issue Dec 19, 2019

Fix some race conditions improbable-eng/etcd-cluster-operator#123

Merged

alexeldeib mentioned this issue Jan 20, 2020

UMBRELLA: design and refactor graceful termination #764

Closed

k8s-ci-robot closed this as completed Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manager does not wait for Runnables to stop #350

Manager does not wait for Runnables to stop #350

grantr commented Mar 5, 2019

DirectXMan12 commented Mar 6, 2019

rajathagasthya commented Mar 6, 2019

DirectXMan12 commented Mar 6, 2019

grantr commented Mar 6, 2019

fejta-bot commented Jun 4, 2019

rajathagasthya commented Jun 5, 2019

DirectXMan12 commented Jun 5, 2019

rajathagasthya commented Aug 20, 2019

DirectXMan12 commented Aug 22, 2019

DirectXMan12 commented Aug 22, 2019

fejta-bot commented Nov 20, 2019

seh commented Nov 21, 2019

alexeldeib commented Dec 2, 2019

vincepri commented Feb 20, 2020

k8s-ci-robot commented Feb 20, 2020

Manager does not wait for Runnables to stop #350

Manager does not wait for Runnables to stop #350

Comments

grantr commented Mar 5, 2019

DirectXMan12 commented Mar 6, 2019

rajathagasthya commented Mar 6, 2019

DirectXMan12 commented Mar 6, 2019

grantr commented Mar 6, 2019

fejta-bot commented Jun 4, 2019

rajathagasthya commented Jun 5, 2019

DirectXMan12 commented Jun 5, 2019

rajathagasthya commented Aug 20, 2019

DirectXMan12 commented Aug 22, 2019

DirectXMan12 commented Aug 22, 2019

fejta-bot commented Nov 20, 2019

seh commented Nov 21, 2019

alexeldeib commented Dec 2, 2019

vincepri commented Feb 20, 2020

k8s-ci-robot commented Feb 20, 2020