Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager does not wait for Runnables to stop #350

Closed
grantr opened this issue Mar 5, 2019 · 15 comments
Closed

Manager does not wait for Runnables to stop #350

grantr opened this issue Mar 5, 2019 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@grantr
Copy link
Contributor

grantr commented Mar 5, 2019

Currently Manager.Start immediately returns when stop channel is closed without waiting for Runnables to stop:

select {
case <-stop:
// We are done
return nil

This makes writing a Runnable with blocking stop logic difficult or impossible. For example, the Manager's internal serveMetrics is basically a Runnable (a blocking method that takes a stop channel). When the stop channel is closed, serveMetrics tries to shut down its http server:

select {
case <-stop:
if err := server.Shutdown(context.Background()); err != nil {
cm.errChan <- err
}

But in normal usage, the process immediately exits when Manager.Start returns so the Shutdown call is unlikely to complete.

Adding sync.WaitGroup accounting inside the Runnable wrapper goroutines would allow the addition of a Wait() method to the Manager interface that blocks until all Runnables have returned (or WaitWithTimeout(time.Duration), Shutdown(context.Context) for a termination timeout), without changing either the Runnable contract or the Manager contract:

if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
    log.Error(err, "unable to run the manager")
    os.Exit(1)
}
mgr.Wait()
@DirectXMan12
Copy link
Contributor

/kind bug
/priority important-soon

good catch

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 6, 2019
@rajathagasthya
Copy link
Contributor

I'll take this!

@DirectXMan12
Copy link
Contributor

awesome!

@grantr
Copy link
Contributor Author

grantr commented Mar 6, 2019

While investigating this I discovered another issue: the Manager's errChan is unbuffered, so only one Runnable goroutine (including the metrics server) will be able to terminate correctly. Others will block indefinitely. An easy fix for this is making the errChan buffered, but that's not a great long-term solution.

Anyone working on solutions in this area might want to keep #330 in mind too.

grantr added a commit to grantr/eventing that referenced this issue Mar 6, 2019
Manager doesn't actually wait for Runnables to stop, so we need to add a
WaitGroup to wait for the HTTP Server Shutdown to complete. This is
hopefully temporary until
kubernetes-sigs/controller-runtime#350 is
fixed.
grantr added a commit to grantr/eventing that referenced this issue Mar 20, 2019
Manager doesn't actually wait for Runnables to stop, so we need to add a
WaitGroup to wait for the HTTP Server Shutdown to complete. This is
hopefully temporary until
kubernetes-sigs/controller-runtime#350 is
fixed.
knative-prow-robot pushed a commit to knative/eventing that referenced this issue Mar 25, 2019
* Add a prometheus server to broker ingress

* correct method comment

* Add a metrics port to the broker's ingress Service

Ultimately access to this port might need different permissions than the
default (ingress) port.

* Always shut down runnableServer

If ShutdownTimeout is not positive, call Shutdown anyway without a
timeout.

* Add port values to the ingress container spec

These are informational, and more useful now that two ports are
exposed.

* Instrument message count and dispatch time

Also register the exporter so it can start serving metrics.

* Simplify runnableServer shutdown

No need to select since we're just waiting for the channel to be closed.

* Use a WaitGroup to stop runnableServers

Manager doesn't actually wait for Runnables to stop, so we need to add a
WaitGroup to wait for the HTTP Server Shutdown to complete. This is
hopefully temporary until
kubernetes-sigs/controller-runtime#350 is
fixed.

* Hook up internal controller-runtime logger

* Include GCP auth library

Running outside GCP seems to not work without this.

* Translate commented code into a directive

* Tag measurements with the broker name

The BROKER environment variable may contain the broker name. If
non-empty, measurements will be tagged with the given string.

* Make BROKER env var required

There's no use case for running this without an existing broker context,
so just require the BROKER name to be present.

* Add a separate shutdownTimeout var

Currently the same as writeTimeout, but allows for having separate write
and shutdown timeouts later.

* Add a shutdown timer for the waitgroup

wg.Wait() can block indefinitely. Adding a timer here ensures the
process shutdown time is bounded.

* Don't redeclare brokerName

We want to use the package var here.

* Move wg.Add outside the goroutine

This eliminates a case in which wg.Done could be called before wg.Add
depending on how the goroutine is scheduled.

* Use Fatalf instead of Fatal

* Update Gopkg.lock

* Get broker name from env var BROKER

It's now a struct field instead of a package var.

* Use ok instead of success

For consistency with HTTP error codes.

* Test RunnableServer

Tests ensure the server starts, responds to requests, and stops or
shuts down as requested.

* Set brokerName in handler struct

This was accidentally removed in a merge.

* Format and expand comments on shutdown behavior

* Remove unnecessary comments

* Update copyright year

* Add test to verify correct usage of context

Verifies that the context is correctly timing out shutdown.

* Remove logging from RunnableServer

Return the error instead and let the caller decide whether to log it.

The ShutdownContext test now occasionally flakes; still tracking that
down.

* Attempt to document metrics

Lists metrics exposed by Broker ingress and the port on which they're exposed.
As we add metrics to other components, we can list them in this file.

* Improve stability of shutdown test

Request goroutine needs a bit more time to start sometimes.

Removed Logf calls from goroutines since they sometimes happen
after the test completes, causing a panic.

Removed the error check from the http get for the same reason.

* Rename logf to crlog

Documents source of package more clearly.

* Remove obsolete logger field from RunnableServer
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019
@rajathagasthya
Copy link
Contributor

/remove-lifecycle stale

Sorry, haven't had a chance to look at this. Will do shortly. Might need your guidance a bit @DirectXMan12

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2019
@DirectXMan12
Copy link
Contributor

ack, lmk if/when you need my guidance

@rajathagasthya
Copy link
Contributor

@DirectXMan12 I'm starting to take a look at this. Could you guide me on what a solution could look like?

@DirectXMan12
Copy link
Contributor

off the top of my head, we'd need some way to check that the runnables each terminate. A wait group, as suggested, seems reasonable. Just rolling it into Start instead of a separate wait method could be reasonable, but I'd imaging that'd have some issues too (like delayed error reporting). Need to weigh the pros and cons there.

@DirectXMan12
Copy link
Contributor

probably at least log immediately

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2019
@seh
Copy link

seh commented Nov 21, 2019

I’m bitten by this problem with my use of an instrumentation package that I try to shut down cleanly before the process exits. My predicates and reconcilers are making calls into this package, and there’s no way I can see that I can wait for them all to finish before trying to close the connection they’re all using.

It’s one thing to know that the manager observed the “stop” channel becoming ready, when it returns from Start, but that says nothing about the current state of the Runnables—even those that ask for the “stop” channel to be injected.

Using async.WaitGroup sounds like the way to go. Whether delaying the return from Start should be something you can opt into or out of warrants consideration.

@alexeldeib
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2019
@vincepri
Copy link
Member

Closing in favor of #764

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Closing in favor of #764

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

8 participants