-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145
Labels
area/csm-resiliency
Issue pertains to the CSM Resiliency module
type/feature
A feature. This label is applied to a feature issues.
Milestone
Comments
eanjtab
added
needs-triage
Issue requires triage.
type/bug
Something isn't working. This is the default label associated with a bug issue.
labels
Dec 23, 2021
prablr79
added
the
area/csi-powerflex
Issue pertains to the CSI Driver for Dell EMC PowerFlex
label
Dec 24, 2021
hoppea2
added
area/csm-resiliency
Issue pertains to the CSM Resiliency module
and removed
area/csi-powerflex
Issue pertains to the CSI Driver for Dell EMC PowerFlex
labels
Jan 4, 2022
Updated the label to reflect correct module. We are reviewing this and will update shortly |
hoppea2
added
type/feature-request
New feature request. This is the default label associated with a feature request issue.
and removed
type/bug
Something isn't working. This is the default label associated with a bug issue.
labels
Jan 4, 2022
hoppea2
changed the title
[BUG]: CSI node driver failure but pods are scheduled on that node
[Feature Request]: CSI node driver failure but pods are scheduled on that node
Jan 4, 2022
We have added this to our product roadmap. No milestone has yet been determined. |
@hoppea2 : If no milestone has been determined, we need to remove the milestone associated. Reflecting correct status |
hoppea2
added
type/feature
A feature. This label is applied to a feature issues.
and removed
type/feature-request
New feature request. This is the default label associated with a feature request issue.
labels
Feb 15, 2022
hoppea2
changed the title
[Feature Request]: CSI node driver failure but pods are scheduled on that node
[Feature]: CSI node driver failure but pods are scheduled on that node
Feb 15, 2022
hoppea2
changed the title
[Feature]: CSI node driver failure but pods are scheduled on that node
[FEATURE]: CSI node driver failure but pods are scheduled on that node
May 9, 2022
9 tasks
9 tasks
5 tasks
This was referenced May 23, 2022
alikdell
changed the title
[FEATURE]: CSI node driver failure but pods are scheduled on that node
[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node
Jun 16, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/csm-resiliency
Issue pertains to the CSM Resiliency module
type/feature
A feature. This label is applied to a feature issues.
Describe the feature
The CSI node driver when not running on a WN (worker node). the CSI controller should check and taint that node so that pods are not scheduled on that node.
Feature functionality
When CSI Driver node pods are not running, CSM for Resiliency should taint that node so that new pods are not scheduled in that node. This will prohibits pods to fail where the starting or running of that pod/s need CSI Driver node pods to be running on that node.
Customer observation:
I am not sure how to exactly create this situation, but we do have a situation currently where CSI node driver pods are in "CrashLoopBackOff" state on WN 01 and 04 but there is no Taint on those nodes as a result the pods are scheduled on that node and fails due to unable to mount the volume. Please refer to the below output: -
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl describe node mm-pool01-blccdmm02-wn001 |grep -i Taint
Taints:
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl describe node mm-pool02-blccdmm02-wn004 |grep -i Taint
Taints:
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl get pods -A |grep -i csi
kube-system csi-powerflex-controller-6b56d8fc9c-cgkvp 6/6 Running 0 3d15h
kube-system csi-powerflex-controller-6b56d8fc9c-fpr6j 6/6 Running 2 3d15h
kube-system csi-powerflex-node-5r5gq 3/3 Running 0 3d15h
kube-system csi-powerflex-node-cm6d5 3/3 Running 0 3d15h
kube-system csi-powerflex-node-dddqb 3/3 Running 0 3d15h
kube-system csi-powerflex-node-jmbh8 2/3 CrashLoopBackOff 482 2d
kube-system csi-powerflex-node-qjc84 3/3 Running 7 3d15h
kube-system csi-powerflex-node-w4hpl 2/3 CrashLoopBackOff 482 2d
kube-system csi-powerflex-node-w9lmf 3/3 Running 0 3d15h
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
Expected behavior
CSI controller pod should check the status of CSI node pods and place a taint on the node if the CSI node pod is not running or crashed.
Screenshots
NA
Logs
Attached
System Information (please complete the following information):
Additional context
For some reason I have the logs collected by the collect_log,.sh script but unable to attach it here. I can always share it on mail. Please contact me on mail.
The text was updated successfully, but these errors were encountered: