-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Wait for runnables to stop fix for #350 and #429 #664
🐛 Wait for runnables to stop fix for #350 and #429 #664
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dbenque The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @dbenque! |
Hi @dbenque. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
97d3e5f
to
c013ca0
Compare
/assign @pwittrock |
@dbenque can you fix the conflicts? |
c013ca0
to
20e791d
Compare
@alvaroaleman the PR has been rebase and conflict resolved. Note that I had to rework part of fe4ada0 This PR #664 is giving same result but it is not black-holing runnable errors:fe4ada0#diff-77faf6b20512574869434402d5c5b6a2R179 This is important because we want to wait for runnable to stop, and so we must catch and handle errors while runnable are stopping. |
/ok-to-test |
This reverts many changes in #651 |
@DirectXMan12 regarding #651 , this PR is achieving the same and also catch all errors of runnables during the teardown period. There should not be any error silently dropped with that PR and at the same time we wait for all runnables stop (or timeout). |
I like this approach. It seems to me like the error draining works nicely here. The same blocking mentioned in #651 (comment) can occur if more than one controller errors out around L392, but since the stop routine drains the channel while it handles proper shutdown, it won't actually block the controllers from exiting. I like this more than the error signaler, personally. One thing I noticed, the runnables are all wired up but manager brings up the metrics endpoint and healthprobes separately. Do we care about gracefully terminating those as well? |
@alexeldeib , you are right and to avoid that on my side I explicitly disable the metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments inline, agree with @alexeldeib that we should be treating the servers as runnables to -- we don't want goroutine leaks on shutdown again.
return err | ||
} | ||
} | ||
func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return cm.waitForRunnableToEnd() | ||
} | ||
|
||
func (cm *controllerManager) waitForRunnableToEnd() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole section needs an overview comment of the stop procedure stuff
allStopped := make(chan struct{}) | ||
|
||
go func() { | ||
cm.waitForRunnable.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this'll leak through a timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure there's a good way around it though
@@ -497,7 +503,7 @@ func (cm *controllerManager) waitForCache() { | |||
} | |||
go func() { | |||
if err := cm.startCache(cm.internalStop); err != nil { | |||
cm.errSignal.SignalError(err) | |||
cm.errChan <- err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda feel like we should never have anything writing to the error channel directly like this, and instead just wrap everything in a runnable to avoid accidentally forgetting to increment the runnable counter.
(as a follow up PR for someone -- a test that ensured we don't add any additional leaked goroutines in new code would be nice -- just start the manager then stop it, and use |
(I'll file an issue) |
(#724) |
@dbenque: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
hey, are you still interested in working on this |
@dbenque are you still interested in working on this change? |
@dbenque: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Closing for inactivity, feel free to reopen if necessary. /close |
@vincepri: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR fixes 🐛 #350 and 🐛 #429
The manager.Start function now returns only when all Runnables have properly returned or timeout.
It is possible to define the Timeout value with the manager.Options