Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Wait for runnables to stop fix for #350 and #429 #664

Conversation

dbenque
Copy link
Contributor

@dbenque dbenque commented Oct 29, 2019

This PR fixes 🐛 #350 and 🐛 #429

The manager.Start function now returns only when all Runnables have properly returned or timeout.

It is possible to define the Timeout value with the manager.Options

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 29, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dbenque
To complete the pull request process, please assign pwittrock
You can assign the PR to them by writing /assign @pwittrock in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 29, 2019
@k8s-ci-robot
Copy link
Contributor

Welcome @dbenque!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @dbenque. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 29, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 29, 2019
@dbenque dbenque changed the title 🐛 Wait for runnable to stop fix for #350 and #429 Wait for runnables to stop fix for #350 and #429 Oct 29, 2019
@dbenque dbenque force-pushed the david.benque/wait-for-runnable-to-stop branch 2 times, most recently from 97d3e5f to c013ca0 Compare October 29, 2019 12:48
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 29, 2019
@dbenque
Copy link
Contributor Author

dbenque commented Oct 29, 2019

/assign @pwittrock

@dbenque dbenque changed the title Wait for runnables to stop fix for #350 and #429 🐛 Wait for runnables to stop fix for #350 and #429 Oct 29, 2019
@alvaroaleman
Copy link
Member

@dbenque can you fix the conflicts?

@dbenque dbenque force-pushed the david.benque/wait-for-runnable-to-stop branch from c013ca0 to 20e791d Compare November 13, 2019 17:41
@dbenque
Copy link
Contributor Author

dbenque commented Nov 13, 2019

@alvaroaleman the PR has been rebase and conflict resolved.

Note that I had to rework part of fe4ada0

This PR #664 is giving same result but it is not black-holing runnable errors:fe4ada0#diff-77faf6b20512574869434402d5c5b6a2R179

This is important because we want to wait for runnable to stop, and so we must catch and handle errors while runnable are stopping.

@droot droot assigned mengqiy and unassigned pwittrock Nov 14, 2019
@mengqiy
Copy link
Member

mengqiy commented Nov 15, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 15, 2019
@mengqiy
Copy link
Member

mengqiy commented Nov 15, 2019

This reverts many changes in #651
/assign @DirectXMan12

@dbenque
Copy link
Contributor Author

dbenque commented Nov 15, 2019

@DirectXMan12 regarding #651 , this PR is achieving the same and also catch all errors of runnables during the teardown period. There should not be any error silently dropped with that PR and at the same time we wait for all runnables stop (or timeout).
I hope that help for the review.

@alexeldeib
Copy link
Contributor

I like this approach. It seems to me like the error draining works nicely here. The same blocking mentioned in #651 (comment) can occur if more than one controller errors out around L392, but since the stop routine drains the channel while it handles proper shutdown, it won't actually block the controllers from exiting. I like this more than the error signaler, personally.

One thing I noticed, the runnables are all wired up but manager brings up the metrics endpoint and healthprobes separately. Do we care about gracefully terminating those as well?

@dbenque
Copy link
Contributor Author

dbenque commented Dec 6, 2019

One thing I noticed, the runnables are all wired up but manager brings up the metrics endpoint and healthprobes separately. Do we care about gracefully terminating those as well?

@alexeldeib , you are right and to avoid that on my side I explicitly disable the metrics ManagerOptions.MetricsBindAddress = "0" embedded in the manager: I have have create a dedicated runnable for metrics and also one for the healthprobes. Then I start them like any other runnable and benefit from the stop sequence implemented in that PR.

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments inline, agree with @alexeldeib that we should be treating the servers as runnables to -- we don't want goroutine leaks on shutdown again.

return err
}
}
func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return cm.waitForRunnableToEnd()
}

func (cm *controllerManager) waitForRunnableToEnd() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole section needs an overview comment of the stop procedure stuff

allStopped := make(chan struct{})

go func() {
cm.waitForRunnable.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this'll leak through a timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure there's a good way around it though

@@ -497,7 +503,7 @@ func (cm *controllerManager) waitForCache() {
}
go func() {
if err := cm.startCache(cm.internalStop); err != nil {
cm.errSignal.SignalError(err)
cm.errChan <- err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda feel like we should never have anything writing to the error channel directly like this, and instead just wrap everything in a runnable to avoid accidentally forgetting to increment the runnable counter.

@DirectXMan12
Copy link
Contributor

(as a follow up PR for someone -- a test that ensured we don't add any additional leaked goroutines in new code would be nice -- just start the manager then stop it, and use runtime to check the goroutine count)

@DirectXMan12
Copy link
Contributor

(I'll file an issue)

@DirectXMan12
Copy link
Contributor

(#724)

@k8s-ci-robot
Copy link
Contributor

@dbenque: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-controller-runtime-test-master 20e791d link /test pull-controller-runtime-test-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@DirectXMan12
Copy link
Contributor

hey, are you still interested in working on this

@vincepri
Copy link
Member

@dbenque are you still interested in working on this change?

@k8s-ci-robot
Copy link
Contributor

@dbenque: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 20, 2020
@vincepri vincepri added this to the Next milestone Feb 21, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2020
@vincepri
Copy link
Member

Closing for inactivity, feel free to reopen if necessary.

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closed this PR.

In response to this:

Closing for inactivity, feel free to reopen if necessary.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants