-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662
Conversation
Hi @typeid. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how OpenShift's handles this situation as well https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171. Furthermore, this allows pods to successfully be deleted, triggering CSI drivers to detach the relevant volumes, which allows the machine to be replaced
/lgtm
LGTM label has been added. Git tree hash: 038bbf764ac6107032bcd184e1434415cc660f3d
|
@willie-yao @Jont828 Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well? |
Do MachineHealthChecks work with MachinePool Machines? (I just don't remember that we implemented anything for it, at least in core CAPI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
I know for a fact that MachineHealthCheck is not implemented for ClusterClass MachinePools, as stated in the "Later" section of issue #5991. @Jont828 are you able to verify if they are implemented for MachinePoolMachines? |
@typeid can you clarify where exactly in this https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L268-L360 flow this makes the difference? |
@enxebre Skipping a couple steps here, but That said, this is also behavior that could also be handled via https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown instead. The description exactly describes the symptoms of this bug as well.
|
I guess for us to take advantage of it the instance would need to be in the process of being shut down already, so it makes sense to me to mimic it earlier, at least for now.
So does this effectively invalidates our SkipWaitForDeleteTimeoutSeconds, since all pods will match GracePeriodSeconds and get terminated first? |
I initially had the same understanding as @mjlshen's explanation, but I did more digging and I'm no longer sure just setting the grace period to 1 is actually solving the issue here. Reason:
So I would expect the following to happen (current logic as well as with 1 second grace period - nothing differs in terms of pod deletion):
However, at this point, the eviction API has already been called (both cases, current vs PR), and the deletion grace period should be set on the pod, so the api server should take care of it. I re-tested draining with grace period 1, and it doesn't delete the pods for my reproducer. I thought I had tried that, but by not working I think that confirms the logic above. Setting the grace period to 0, however, results in the pods deleting. Further digging in apiserver...
The logic after that gets a bit complex, but it seems like setting it to I tried reproducing this with MAPI MHC, and it skips the pod deletion but doesn't get stuck replacing the node: I'm not sure what happens with the volume. Maybe it doesn't block the node deletion due to existing volumes? I'll have to dig further on that.
Not necessarily. I think in the case the pods still don't delete even with grace period = 1, we would still skip them after their I'll sync up with @mjlshen to get the same understanding and see if I'm possibly missing something here. :) |
With a pending source code verification I assume the second case won't help in this scenario because it requires kubelet coordination via pod status, which is not possible because the kubelet is unreachable.
Yes, the volume detaching decoupling form draining it's not implemented in mapi. So to summarise, the erratic behaviour is that we have a semantic to ack an unreachable Node and an opinion on how react to it (We Assume that giving up on those pods after 5min is safe as the eviction request is satisfied by pdbs) but we are not consistent on that opinion for volumes dettachment. FWIW see related old discussion when we introduced this semantic #2165 (comment) |
That sounds good to me. I think we should still make the drain consistent to what MAPI does for unreachable nodes: setting GracePeriod and the PodDeletionTimeout both to 1 second. While this still would not fix issue #10661 (to my current understanding), it would at least allow faster replacement of unreachable nodes. I will open a new issue for that and link a new PR to not cause extra confusion by remapping this PR. |
LGTM label has been added. Git tree hash: d99ffbe32ca8dfdfe705b34e26485170293b14c8
|
ptal @chrischdi @sbueringer |
I'll take a look, but I need some time to get a full picture |
Implementation makes sense to me and matches the issue /lgtm |
Just a quick update, I'm looking into this. Just have to double check something with folks downstream. I'll get back to you ASAP. |
/lgtm Slightly renamed the PR, because this affects more than just MHC. /hold |
/hold cancel |
Thx folks! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.7 |
@sbueringer: new pull request created: #10765 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
When MHC are remediating an unresponsive node, CAPI drains the node and gives up waiting for pods to be deleted after the
SkipWaitForDeleteTimeoutSeconds
. If one of the pods it gave up on has an attached volume, the machine deletion gets stuck.This PR aligns the behavior for volume detachment during deletion to what we're doing with the drains: if the node is unreachable, we skip detaching volumes.
Which issue(s) this PR fixes
Fixes #10661
/area machine