-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706
Conversation
Hi @typeid. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 3be733612fe5af9c5e0a93e5ef326ea019f46b3a
|
/ok-to-test |
It would be good though if we could add an e2e testing this behavior and expectations |
/retest
@vincepri I can definitely take a look at doing that. It's going to be my first shot at adding an E2E test here, so no promises. I'll check if there's something I can re-use, but the node being unreachable doesn't seem something we already have a base for. |
Just to clarify as this topic gets fuzzy every time we need to get back to it after a while. I would replace "unhealthy" (ambiguous) with "Unreachable" (ready = Unknown), which is the only semantic where we set this timeout because we assume those pods as gone. This PR is irrelevant for other "Unhealthy" criteria. |
@enxebre / @vincepri The E2E doesn't seem trivial. I would skip adding the E2E for now if that's okay. Shall I create a follow-up issue to add functionality in our test framework to render nodes unreachable? This would allow us to improve our coverage in other areas as well. Adding a unit test here would also not add any coverage of value IMO and require more code changes. |
I'm good with that, I'll defer to Vince to cancel the hold as he considers |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Co-authored-by: Michael Shen <[email protected]>
/lgtm |
LGTM label has been added. Git tree hash: 4f450481d31b49cacb7dda4d01cf0c96c6da8086
|
/lgtm Thx everyone for the explanations! /hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold cancel
/lgtm
/hold If that's okay? EDIT: I think tide is going to merge regardless again :/ (I think once a PR is in the merge pool there is no way to get it out of it) |
Sorry I was under the impression per Alberto's comment that we were waiting for my review 😄 |
Thx everyone! /hold cancel |
/cherry-pick release-1.7 |
@sbueringer: new pull request created: #10766 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
This PR reduces the time we wait for pod deletion in the case of unreachable nodes: we currently wait for the deletionTimestamp to be 5 minutes old. The goal is to allow MHC to replace unreachable nodes faster.
This PR aligns the behavior of CAPI with OpenShift MAPI: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171
This PR is a result from a discussion here.
Which issue(s) this PR fixes
Fixes #10705
Area example:
/area machine