[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145

eanjtab · 2021-12-23T19:48:19Z

Describe the feature
The CSI node driver when not running on a WN (worker node). the CSI controller should check and taint that node so that pods are not scheduled on that node.

Feature functionality
When CSI Driver node pods are not running, CSM for Resiliency should taint that node so that new pods are not scheduled in that node. This will prohibits pods to fail where the starting or running of that pod/s need CSI Driver node pods to be running on that node.

Customer observation:

I am not sure how to exactly create this situation, but we do have a situation currently where CSI node driver pods are in "CrashLoopBackOff" state on WN 01 and 04 but there is no Taint on those nodes as a result the pods are scheduled on that node and fails due to unable to mount the volume. Please refer to the below output: -

(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl describe node mm-pool01-blccdmm02-wn001 |grep -i Taint
Taints:
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl describe node mm-pool02-blccdmm02-wn004 |grep -i Taint
Taints:
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl get pods -A |grep -i csi
kube-system csi-powerflex-controller-6b56d8fc9c-cgkvp 6/6 Running 0 3d15h
kube-system csi-powerflex-controller-6b56d8fc9c-fpr6j 6/6 Running 2 3d15h
kube-system csi-powerflex-node-5r5gq 3/3 Running 0 3d15h
kube-system csi-powerflex-node-cm6d5 3/3 Running 0 3d15h
kube-system csi-powerflex-node-dddqb 3/3 Running 0 3d15h
kube-system csi-powerflex-node-jmbh8 2/3 CrashLoopBackOff 482 2d
kube-system csi-powerflex-node-qjc84 3/3 Running 7 3d15h
kube-system csi-powerflex-node-w4hpl 2/3 CrashLoopBackOff 482 2d
kube-system csi-powerflex-node-w9lmf 3/3 Running 0 3d15h
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$

Expected behavior
CSI controller pod should check the status of CSI node pods and place a taint on the node if the CSI node pod is not running or crashed.

Screenshots
NA

Logs
Attached

System Information (please complete the following information):

OS/Version: SUSE, 15-SP2
Kubernetes Version: v1.21.1

Additional context
For some reason I have the logs collected by the collect_log,.sh script but unable to attach it here. I can always share it on mail. Please contact me on mail.

hoppea2 · 2022-01-04T16:10:30Z

Updated the label to reflect correct module. We are reviewing this and will update shortly

hoppea2 · 2022-01-13T15:49:03Z

We have added this to our product roadmap. No milestone has yet been determined.

shanmydell · 2022-02-14T06:28:14Z

@hoppea2 : If no milestone has been determined, we need to remove the milestone associated. Reflecting correct status

eanjtab added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Dec 23, 2021

prablr79 added the area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex label Dec 24, 2021

prablr79 assigned nb950 Dec 24, 2021

prablr79 added this to the v1.2.0 milestone Dec 24, 2021

hoppea2 added area/csm-resiliency Issue pertains to the CSM Resiliency module and removed area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex labels Jan 4, 2022

hoppea2 assigned sharmarahul5 and unassigned nb950 Jan 4, 2022

hoppea2 added type/feature-request New feature request. This is the default label associated with a feature request issue. and removed type/bug Something isn't working. This is the default label associated with a bug issue. labels Jan 4, 2022

hoppea2 changed the title ~~[BUG]: CSI node driver failure but pods are scheduled on that node~~ [Feature Request]: CSI node driver failure but pods are scheduled on that node Jan 4, 2022

hoppea2 unassigned sharmarahul5 Jan 13, 2022

shanmydell removed this from the v1.2.0 milestone Feb 14, 2022

hoppea2 added type/feature A feature. This label is applied to a feature issues. and removed type/feature-request New feature request. This is the default label associated with a feature request issue. labels Feb 15, 2022

hoppea2 changed the title ~~[Feature Request]: CSI node driver failure but pods are scheduled on that node~~ [Feature]: CSI node driver failure but pods are scheduled on that node Feb 15, 2022

alikdell self-assigned this Apr 28, 2022

alikdell removed the needs-triage Issue requires triage. label Apr 28, 2022

hoppea2 added the backlog label May 9, 2022

hoppea2 changed the title ~~[Feature]: CSI node driver failure but pods are scheduled on that node~~ [FEATURE]: CSI node driver failure but pods are scheduled on that node May 9, 2022

shaynafinocchiaro mentioned this issue May 17, 2022

CSI Node Driver Failure Test Script dell/karavi-resiliency#121

Merged

9 tasks

shanmydell added this to the v1.3.0 milestone May 19, 2022

Sakshi-dell mentioned this issue May 20, 2022

Add label to driver node pod for Resiliency protection dell/csi-powerscale#84

Merged

9 tasks

randeepdell mentioned this issue May 23, 2022

update-powerscale-support dell/csm-docs#226

Merged

5 tasks

This was referenced May 23, 2022

Update Resiliency documentation for 1.3 release dell/csm-docs#231

Merged

Resiliency new label added for csi driver node pods dell/csi-powerflex#99

Merged

Add integration test for CSI Driver node pod monitor dell/karavi-resiliency#125

Merged

alikdell closed this as completed May 31, 2022

alikdell changed the title ~~[FEATURE]: CSI node driver failure but pods are scheduled on that node~~ [FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145

[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145

eanjtab commented Dec 23, 2021 •

edited by alikdell

Loading

hoppea2 commented Jan 4, 2022

hoppea2 commented Jan 13, 2022

shanmydell commented Feb 14, 2022

[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145

[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145

Comments

eanjtab commented Dec 23, 2021 • edited by alikdell Loading

hoppea2 commented Jan 4, 2022

hoppea2 commented Jan 13, 2022

shanmydell commented Feb 14, 2022

eanjtab commented Dec 23, 2021 •

edited by alikdell

Loading