-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manager does not wait for Runnables to stop #350
Comments
/kind bug good catch |
I'll take this! |
awesome! |
While investigating this I discovered another issue: the Manager's Anyone working on solutions in this area might want to keep #330 in mind too. |
Manager doesn't actually wait for Runnables to stop, so we need to add a WaitGroup to wait for the HTTP Server Shutdown to complete. This is hopefully temporary until kubernetes-sigs/controller-runtime#350 is fixed.
Manager doesn't actually wait for Runnables to stop, so we need to add a WaitGroup to wait for the HTTP Server Shutdown to complete. This is hopefully temporary until kubernetes-sigs/controller-runtime#350 is fixed.
* Add a prometheus server to broker ingress * correct method comment * Add a metrics port to the broker's ingress Service Ultimately access to this port might need different permissions than the default (ingress) port. * Always shut down runnableServer If ShutdownTimeout is not positive, call Shutdown anyway without a timeout. * Add port values to the ingress container spec These are informational, and more useful now that two ports are exposed. * Instrument message count and dispatch time Also register the exporter so it can start serving metrics. * Simplify runnableServer shutdown No need to select since we're just waiting for the channel to be closed. * Use a WaitGroup to stop runnableServers Manager doesn't actually wait for Runnables to stop, so we need to add a WaitGroup to wait for the HTTP Server Shutdown to complete. This is hopefully temporary until kubernetes-sigs/controller-runtime#350 is fixed. * Hook up internal controller-runtime logger * Include GCP auth library Running outside GCP seems to not work without this. * Translate commented code into a directive * Tag measurements with the broker name The BROKER environment variable may contain the broker name. If non-empty, measurements will be tagged with the given string. * Make BROKER env var required There's no use case for running this without an existing broker context, so just require the BROKER name to be present. * Add a separate shutdownTimeout var Currently the same as writeTimeout, but allows for having separate write and shutdown timeouts later. * Add a shutdown timer for the waitgroup wg.Wait() can block indefinitely. Adding a timer here ensures the process shutdown time is bounded. * Don't redeclare brokerName We want to use the package var here. * Move wg.Add outside the goroutine This eliminates a case in which wg.Done could be called before wg.Add depending on how the goroutine is scheduled. * Use Fatalf instead of Fatal * Update Gopkg.lock * Get broker name from env var BROKER It's now a struct field instead of a package var. * Use ok instead of success For consistency with HTTP error codes. * Test RunnableServer Tests ensure the server starts, responds to requests, and stops or shuts down as requested. * Set brokerName in handler struct This was accidentally removed in a merge. * Format and expand comments on shutdown behavior * Remove unnecessary comments * Update copyright year * Add test to verify correct usage of context Verifies that the context is correctly timing out shutdown. * Remove logging from RunnableServer Return the error instead and let the caller decide whether to log it. The ShutdownContext test now occasionally flakes; still tracking that down. * Attempt to document metrics Lists metrics exposed by Broker ingress and the port on which they're exposed. As we add metrics to other components, we can list them in this file. * Improve stability of shutdown test Request goroutine needs a bit more time to start sometimes. Removed Logf calls from goroutines since they sometimes happen after the test completes, causing a panic. Removed the error check from the http get for the same reason. * Rename logf to crlog Documents source of package more clearly. * Remove obsolete logger field from RunnableServer
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale Sorry, haven't had a chance to look at this. Will do shortly. Might need your guidance a bit @DirectXMan12 |
ack, lmk if/when you need my guidance |
@DirectXMan12 I'm starting to take a look at this. Could you guide me on what a solution could look like? |
off the top of my head, we'd need some way to check that the runnables each terminate. A wait group, as suggested, seems reasonable. Just rolling it into Start instead of a separate wait method could be reasonable, but I'd imaging that'd have some issues too (like delayed error reporting). Need to weigh the pros and cons there. |
probably at least log immediately |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
I’m bitten by this problem with my use of an instrumentation package that I try to shut down cleanly before the process exits. My predicates and reconcilers are making calls into this package, and there’s no way I can see that I can wait for them all to finish before trying to close the connection they’re all using. It’s one thing to know that the manager observed the “stop” channel becoming ready, when it returns from Using a |
/remove-lifecycle stale |
Closing in favor of #764 /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently
Manager.Start
immediately returns when stop channel is closed without waiting for Runnables to stop:controller-runtime/pkg/manager/internal.go
Lines 216 to 219 in 6649bdb
This makes writing a Runnable with blocking stop logic difficult or impossible. For example, the Manager's internal
serveMetrics
is basically a Runnable (a blocking method that takes a stop channel). When the stop channel is closed,serveMetrics
tries to shut down its http server:controller-runtime/pkg/manager/internal.go
Lines 188 to 192 in 6649bdb
But in normal usage, the process immediately exits when Manager.Start returns so the
Shutdown
call is unlikely to complete.Adding
sync.WaitGroup
accounting inside the Runnable wrapper goroutines would allow the addition of aWait()
method to theManager
interface that blocks until all Runnables have returned (orWaitWithTimeout(time.Duration)
,Shutdown(context.Context)
for a termination timeout), without changing either the Runnable contract or the Manager contract:The text was updated successfully, but these errors were encountered: