-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: CSM Resiliency supports evacuation of pods during NoExecute taint on node #87
Comments
Hello, my analysis shows that podmon is working as designed. Please see the analysis write up in |
Closing issue since it works as designed. |
Hello Tom, I saw the response, so since NoExec taint was not placed the podmon did not actually migrate the pods. If NoExec was placed then it will migrate ? |
Hello, |
Thanks Tom for the explanation, so I see that in either case NoSched or NoExec since the podmon detects that array connectivity is established and good it does not take any action. But I feel we need to change this now. Because there could be different situations where kube-api would place these taints in which we will end up pods stuck in terminating phase. Like for eg: - if the internal HB K8 network goes down then also we may end up with similar situation. So definitely a good option here would be to look at the reboot option when the NoExec taint is placed on that node. Who will trigger the reboot ? Kube-api or can podmon do it ? Please see my messages on Team, do you have time for a quick call ? |
We are currently testing a change to address this and plan to ship it in the next release at end Q1 2022. At the current time, we are not planning to automatically reboot the node to fix this, but rather allow the pods to be deleted if we receive a NoExecute taint and SkipArrayConnectionValidation option is set to true. |
Functionality will be available in CSM for Resiliency next release |
* adding troubleshooting scenario * fixing typo
Describe the feature
The original Resiliency design refused to force delete pods if they are potentially doing I/O to the array. The customer's request is a valid one, although not quite as safe as the current behavior of resiliency. I would propose that we make this behavior an option, i.e:
Additional context
This feature has been converted from a bug. Logs have been attached.
session-logs.txt
The text was updated successfully, but these errors were encountered: