🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

typeid · 2024-05-30T15:43:54Z

What this PR does / why we need it:

This PR reduces the time we wait for pod deletion in the case of unreachable nodes: we currently wait for the deletionTimestamp to be 5 minutes old. The goal is to allow MHC to replace unreachable nodes faster.

This PR aligns the behavior of CAPI with OpenShift MAPI: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171

This PR is a result from a discussion here.

Which issue(s) this PR fixes
Fixes #10705

Area example:
/area machine

k8s-ci-robot · 2024-05-30T15:44:03Z

Hi @typeid. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

internal/controllers/machine/machine_controller.go

vincepri

/lgtm

k8s-ci-robot · 2024-05-30T21:43:25Z

LGTM label has been added.

Git tree hash: 3be733612fe5af9c5e0a93e5ef326ea019f46b3a

vincepri · 2024-05-30T21:43:29Z

/ok-to-test

vincepri · 2024-05-30T21:43:45Z

It would be good though if we could add an e2e testing this behavior and expectations

typeid · 2024-05-30T22:10:01Z

/retest

It would be good though if we could add an e2e testing this behavior and expectations

@vincepri I can definitely take a look at doing that. It's going to be my first shot at adding an E2E test here, so no promises. I'll check if there's something I can re-use, but the node being unreachable doesn't seem something we already have a base for.

enxebre · 2024-05-31T07:03:52Z

This PR reduces the time we wait for pod deletion in the case of unhealthy nodes:

Just to clarify as this topic gets fuzzy every time we need to get back to it after a while. I would replace "unhealthy" (ambiguous) with "Unreachable" (ready = Unknown), which is the only semantic where we set this timeout because we assume those pods as gone. This PR is irrelevant for other "Unhealthy" criteria.

typeid · 2024-06-03T09:33:39Z

@enxebre / @vincepri The E2E doesn't seem trivial. I would skip adding the E2E for now if that's okay. Shall I create a follow-up issue to add functionality in our test framework to render nodes unreachable? This would allow us to improve our coverage in other areas as well.

Adding a unit test here would also not add any coverage of value IMO and require more code changes.

enxebre · 2024-06-03T11:22:35Z

I'm good with that, I'll defer to Vince to cancel the hold as he considers
/approve
/hold

k8s-ci-robot · 2024-06-03T11:22:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

linux-foundation-easycla · 2024-06-04T09:43:26Z

The committers listed above are authorized under a signed CLA.

✅ login: typeid (debbc3a)

Co-authored-by: Michael Shen <[email protected]>

enxebre · 2024-06-06T08:22:10Z

/lgtm
/assign @sbueringer

k8s-ci-robot · 2024-06-06T08:22:17Z

LGTM label has been added.

Git tree hash: 4f450481d31b49cacb7dda4d01cf0c96c6da8086

internal/controllers/machine/machine_controller.go

sbueringer · 2024-06-14T09:39:26Z

/lgtm

Thx everyone for the explanations!

/hold
Similar to the other PR, I asked some other folks for their opinion. I'll get back to you ASAP (I expect early next week)

vincepri

/hold cancel
/lgtm

sbueringer · 2024-06-14T16:00:48Z

Similar to the other PR, I asked some other folks for their opinion. I'll get back to you ASAP (I expect early next week)

/hold

If that's okay?

EDIT: I think tide is going to merge regardless again :/ (I think once a PR is in the merge pool there is no way to get it out of it)
EDIT2: Seems like it worked this time

vincepri · 2024-06-14T16:03:47Z

Sorry I was under the impression per Alberto's comment that we were waiting for my review 😄

sbueringer · 2024-06-17T06:20:08Z

Thx everyone!

/hold cancel

sbueringer · 2024-06-17T08:40:44Z

/cherry-pick release-1.7

k8s-infra-cherrypick-robot · 2024-06-17T08:41:21Z

@sbueringer: new pull request created: #10766

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 30, 2024

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 30, 2024

k8s-ci-robot requested review from killianmuldoon and vincepri May 30, 2024 15:44

typeid commented May 30, 2024

View reviewed changes

internal/controllers/machine/machine_controller.go Show resolved Hide resolved

typeid force-pushed the 10705 branch from b100a83 to 6fcf774 Compare May 30, 2024 15:50

vincepri reviewed May 30, 2024

View reviewed changes

k8s-ci-robot assigned vincepri May 30, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 30, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2024

typeid force-pushed the 10705 branch from 6fcf774 to b743ca7 Compare June 4, 2024 09:43

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2024

k8s-ci-robot requested a review from vincepri June 4, 2024 09:43

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 4, 2024

typeid force-pushed the 10705 branch from b743ca7 to 3874e8a Compare June 4, 2024 09:44

fix: delayed MHC replacement of unreachable nodes

debbc3a

Co-authored-by: Michael Shen <[email protected]>

typeid force-pushed the 10705 branch from 3874e8a to debbc3a Compare June 4, 2024 09:48

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 4, 2024

k8s-ci-robot assigned sbueringer and enxebre Jun 6, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 6, 2024

sbueringer reviewed Jun 13, 2024

View reviewed changes

internal/controllers/machine/machine_controller.go Show resolved Hide resolved

sbueringer changed the title ~~🐛 Fix delayed MHC replacement of unreachable nodes~~ 🐛 Speed up ignoring terminating Pods when draining unreachable Nodes Jun 13, 2024

vincepri approved these changes Jun 14, 2024

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024

k8s-ci-robot merged commit 5b6043e into kubernetes-sigs:main Jun 17, 2024
26 checks passed

k8s-ci-robot added this to the v1.8 milestone Jun 17, 2024

k8s-infra-cherrypick-robot mentioned this pull request Jun 17, 2024

[release-1.7] 🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10766

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

typeid commented May 30, 2024 •

edited

Loading

k8s-ci-robot commented May 30, 2024

vincepri left a comment

k8s-ci-robot commented May 30, 2024

vincepri commented May 30, 2024

vincepri commented May 30, 2024

typeid commented May 30, 2024

enxebre commented May 31, 2024

typeid commented Jun 3, 2024

enxebre commented Jun 3, 2024

k8s-ci-robot commented Jun 3, 2024

linux-foundation-easycla bot commented Jun 4, 2024 •

edited

Loading

enxebre commented Jun 6, 2024

k8s-ci-robot commented Jun 6, 2024

sbueringer commented Jun 14, 2024

vincepri left a comment

sbueringer commented Jun 14, 2024 •

edited

Loading

vincepri commented Jun 14, 2024

sbueringer commented Jun 17, 2024

sbueringer commented Jun 17, 2024

k8s-infra-cherrypick-robot commented Jun 17, 2024

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

Conversation

typeid commented May 30, 2024 • edited Loading

k8s-ci-robot commented May 30, 2024

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 30, 2024

vincepri commented May 30, 2024

vincepri commented May 30, 2024

typeid commented May 30, 2024

enxebre commented May 31, 2024

typeid commented Jun 3, 2024

enxebre commented Jun 3, 2024

k8s-ci-robot commented Jun 3, 2024

linux-foundation-easycla bot commented Jun 4, 2024 • edited Loading

enxebre commented Jun 6, 2024

k8s-ci-robot commented Jun 6, 2024

sbueringer commented Jun 14, 2024

vincepri left a comment

Choose a reason for hiding this comment

sbueringer commented Jun 14, 2024 • edited Loading

vincepri commented Jun 14, 2024

sbueringer commented Jun 17, 2024

sbueringer commented Jun 17, 2024

k8s-infra-cherrypick-robot commented Jun 17, 2024

typeid commented May 30, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 4, 2024 •

edited

Loading

sbueringer commented Jun 14, 2024 •

edited

Loading