Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: CSM Resiliency supports evacuation of pods during NoExecute taint on node #87

Closed
eanjtab opened this issue Nov 1, 2021 · 7 comments
Assignees
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/feature A feature. This label is applied to a feature issues.
Milestone

Comments

@eanjtab
Copy link

eanjtab commented Nov 1, 2021

Describe the feature

The original Resiliency design refused to force delete pods if they are potentially doing I/O to the array. The customer's request is a valid one, although not quite as safe as the current behavior of resiliency. I would propose that we make this behavior an option, i.e:

  • A new configuration variable is introduced. "forceDeleteOnNoExecuteTaint" or something similar, that would change the current behavior so that if podmon receives a notification that the pods is Not Ready and the node has a "NoExecute" taint, we force delete the pod. The default would be false, resulting in the current behavior.
  • If the pod has a grace period toleration for NoExecute (as most do, the default is 5 minutes), we need to do it before the grace period expires so that no replacement pods will be created on the same node being evacuated because of pod affinity. You could do it immediately upon receiving the NoExecute notification, or perhaps wait 1/2 the duration of the toleration (normally 2-3 minutes) in case the node becomes ready rather quickly.

Additional context
This feature has been converted from a bug. Logs have been attached.

session-logs.txt

@eanjtab eanjtab added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Nov 1, 2021
@rbo54
Copy link

rbo54 commented Nov 2, 2021

Hello, my analysis shows that podmon is working as designed. Please see the analysis write up in
CSM-87-analysis.txt

@gallacher gallacher added triage/works-as-intended Applied to a bug issue as part of triage when the issue works as intended by design. and removed needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Nov 2, 2021
@gallacher
Copy link
Contributor

Closing issue since it works as designed.

@eanjtab
Copy link
Author

eanjtab commented Nov 2, 2021

Hello Tom,

I saw the response, so since NoExec taint was not placed the podmon did not actually migrate the pods. If NoExec was placed then it will migrate ?

@rbo54
Copy link

rbo54 commented Nov 3, 2021

Hello,
If you look at my analysis text, you will see podmon received two notifications from kubernetes, the first pod was not ready and there was a nosched taint but not a noexec taint: time="2021-11-02T12:11:31Z" level=info msg="podMonitorHandler: namespace: pmtv1 name: podmontest-85fcc7957c-c94mg nodename: twk8s-10-247-102-219 initialized: true ready: false taints [nosched: true noexec: false podmon: false ]". The second had both nosched and noexec true: time="2021-11-02T12:16:35Z" level=info msg="podMonitorHandler: namespace: pmtv1 name: podmontest-85fcc7957c-c94mg nodename: twk8s-10-247-102-219 initialized: true ready: false taints [nosched: true noexec: true podmon: false ]"
Podmon responds the same either way, if it detects the node is connected to array, and especially if I/O is going on, it aborts the cleanup, because this is the safest course.
An option we could consider would be to force a reboot of the node if a noexec taint was received and we needed to clean a pod but it had connection or I/O. Effectively that is what k8s is sort of trying to do anyway by sending noexec, i.e. get things off the node.
Let me know your thoughts.
Tom

@eanjtab
Copy link
Author

eanjtab commented Nov 3, 2021

Thanks Tom for the explanation, so I see that in either case NoSched or NoExec since the podmon detects that array connectivity is established and good it does not take any action. But I feel we need to change this now. Because there could be different situations where kube-api would place these taints in which we will end up pods stuck in terminating phase. Like for eg: - if the internal HB K8 network goes down then also we may end up with similar situation.

So definitely a good option here would be to look at the reboot option when the NoExec taint is placed on that node. Who will trigger the reboot ? Kube-api or can podmon do it ?

Please see my messages on Team, do you have time for a quick call ?

@gallacher gallacher added type/feature A feature. This label is applied to a feature issues. area/csm-resiliency Issue pertains to the CSM Resiliency module and removed triage/works-as-intended Applied to a bug issue as part of triage when the issue works as intended by design. labels Jan 24, 2022
@gallacher gallacher added this to the v1.2.0 milestone Jan 24, 2022
@gallacher gallacher reopened this Jan 24, 2022
@gallacher gallacher changed the title [BUG]: pods get stuck in terminating phase when kublet is made offline [FEATURE]: Support evacuation of pods during NoExecute taint on node Jan 24, 2022
@rbo54
Copy link

rbo54 commented Jan 31, 2022

We are currently testing a change to address this and plan to ship it in the next release at end Q1 2022. At the current time, we are not planning to automatically reboot the node to fix this, but rather allow the pods to be deleted if we receive a NoExecute taint and SkipArrayConnectionValidation option is set to true.

@alikdell
Copy link
Contributor

alikdell commented Mar 2, 2022

Functionality will be available in CSM for Resiliency next release

@alikdell alikdell closed this as completed Mar 2, 2022
@gallacher gallacher changed the title [FEATURE]: Support evacuation of pods during NoExecute taint on node [FEATURE]: CSM Resiliency supports evacuation of pods during NoExecute taint on node Mar 20, 2022
csmbot pushed a commit that referenced this issue Aug 1, 2023
* adding troubleshooting scenario

* fixing typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/feature A feature. This label is applied to a feature issues.
Projects
None yet
Development

No branches or pull requests

4 participants