Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node #145

Closed
eanjtab opened this issue Dec 23, 2021 · 3 comments
Assignees
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/feature A feature. This label is applied to a feature issues.
Milestone

Comments

@eanjtab
Copy link

eanjtab commented Dec 23, 2021

Describe the feature
The CSI node driver when not running on a WN (worker node). the CSI controller should check and taint that node so that pods are not scheduled on that node.

Feature functionality
When CSI Driver node pods are not running, CSM for Resiliency should taint that node so that new pods are not scheduled in that node. This will prohibits pods to fail where the starting or running of that pod/s need CSI Driver node pods to be running on that node.

Customer observation:

I am not sure how to exactly create this situation, but we do have a situation currently where CSI node driver pods are in "CrashLoopBackOff" state on WN 01 and 04 but there is no Taint on those nodes as a result the pods are scheduled on that node and fails due to unable to mount the volume. Please refer to the below output: -

(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl describe node mm-pool01-blccdmm02-wn001 |grep -i Taint
Taints:
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl describe node mm-pool02-blccdmm02-wn004 |grep -i Taint
Taints:
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$ kubectl get pods -A |grep -i csi
kube-system csi-powerflex-controller-6b56d8fc9c-cgkvp 6/6 Running 0 3d15h
kube-system csi-powerflex-controller-6b56d8fc9c-fpr6j 6/6 Running 2 3d15h
kube-system csi-powerflex-node-5r5gq 3/3 Running 0 3d15h
kube-system csi-powerflex-node-cm6d5 3/3 Running 0 3d15h
kube-system csi-powerflex-node-dddqb 3/3 Running 0 3d15h
kube-system csi-powerflex-node-jmbh8 2/3 CrashLoopBackOff 482 2d
kube-system csi-powerflex-node-qjc84 3/3 Running 7 3d15h
kube-system csi-powerflex-node-w4hpl 2/3 CrashLoopBackOff 482 2d
kube-system csi-powerflex-node-w9lmf 3/3 Running 0 3d15h
(blccdmm02)[ccdadmin@BLcCCDADM01 custom-scripts]$

Expected behavior
CSI controller pod should check the status of CSI node pods and place a taint on the node if the CSI node pod is not running or crashed.

Screenshots
NA

Logs
Attached

System Information (please complete the following information):

  • OS/Version: SUSE, 15-SP2
  • Kubernetes Version: v1.21.1

Additional context
For some reason I have the logs collected by the collect_log,.sh script but unable to attach it here. I can always share it on mail. Please contact me on mail.

@eanjtab eanjtab added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Dec 23, 2021
@prablr79 prablr79 added the area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex label Dec 24, 2021
@prablr79 prablr79 added this to the v1.2.0 milestone Dec 24, 2021
@hoppea2 hoppea2 added area/csm-resiliency Issue pertains to the CSM Resiliency module and removed area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex labels Jan 4, 2022
@hoppea2
Copy link
Collaborator

hoppea2 commented Jan 4, 2022

Updated the label to reflect correct module. We are reviewing this and will update shortly

@hoppea2 hoppea2 assigned sharmarahul5 and unassigned nb950 Jan 4, 2022
@hoppea2 hoppea2 added type/feature-request New feature request. This is the default label associated with a feature request issue. and removed type/bug Something isn't working. This is the default label associated with a bug issue. labels Jan 4, 2022
@hoppea2 hoppea2 changed the title [BUG]: CSI node driver failure but pods are scheduled on that node [Feature Request]: CSI node driver failure but pods are scheduled on that node Jan 4, 2022
@hoppea2
Copy link
Collaborator

hoppea2 commented Jan 13, 2022

We have added this to our product roadmap. No milestone has yet been determined.

@shanmydell
Copy link
Collaborator

@hoppea2 : If no milestone has been determined, we need to remove the milestone associated. Reflecting correct status

@shanmydell shanmydell removed this from the v1.2.0 milestone Feb 14, 2022
@hoppea2 hoppea2 added type/feature A feature. This label is applied to a feature issues. and removed type/feature-request New feature request. This is the default label associated with a feature request issue. labels Feb 15, 2022
@hoppea2 hoppea2 changed the title [Feature Request]: CSI node driver failure but pods are scheduled on that node [Feature]: CSI node driver failure but pods are scheduled on that node Feb 15, 2022
@alikdell alikdell self-assigned this Apr 28, 2022
@alikdell alikdell removed the needs-triage Issue requires triage. label Apr 28, 2022
@hoppea2 hoppea2 added the backlog label May 9, 2022
@hoppea2 hoppea2 changed the title [Feature]: CSI node driver failure but pods are scheduled on that node [FEATURE]: CSI node driver failure but pods are scheduled on that node May 9, 2022
@shanmydell shanmydell added this to the v1.3.0 milestone May 19, 2022
@alikdell alikdell changed the title [FEATURE]: CSI node driver failure but pods are scheduled on that node [FEATURE]: Monitor CSI Driver node pods failure in CSM for Resiliency so that pods are not scheduled on that node Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/feature A feature. This label is applied to a feature issues.
Projects
None yet
Development

No branches or pull requests

7 participants