[BUG]: Force delete of pod kicks in late (pod in terminating state for a while) #148

eanjtab · 2022-01-04T17:20:05Z

Describe the bug
When testing a node reboot with a real application we observed that there were two pods as below in terminating state for quite some time and then got finally deleted after the CSI node pod came up running on that WN mm-pool01-blccdmm01-wn001.

eric-bss-em-fm-datbel01fop01-ftp-6dfcb5cfbc-s7n64 0/2 Terminating 0 7m32s mm-pool01-blccdmm01-wn001
eric-bss-em-fm-datbel01nef01-5f6868cbf4-q5spf 0/4 Terminating 0 12m mm-pool01-blccdmm01-wn001

So we wanted to clarify that after the force delete is triggered by CSI which we saw in the logs does it need the CSI to be running to complete the deletion (release and unmount volumes) ?
How about the situation of WN shutdown in which case CSI node pod will be never be running on that node, will force delete work here ?
Did you test a WN total shutdown scenario ?
Is there a specific threshold after which the force delete is triggered ?

time="Fri, 17 Dec 2021 19:55:21 UTC" level=info msg="Deleting pod cembel01/eric-bss-em-fm-datbel01fop01-657c646df6-fmlqb force true"
time="Fri, 17 Dec 2021 19:56:22 UTC" level=info msg="Deleting pod cembel01/eric-bss-em-fm-datbel01nef01-5f6868cbf4-q5spf force true"
time="Fri, 17 Dec 2021 19:56:23 UTC" level=info msg="Deleting pod cembel01/eric-bss-em-fm-datbel01fop01-ftp-6dfcb5cfbc-s7n64 force true"
.

To Reproduce
Steps to reproduce the behavior:

Start about 50 pods with heavy IO. Mix of pods sharing volumes with pod affinity rule.
Restart the WN.
Observe that few pods might be still in deleting state and force delete completes after the CSI node is up running again on the WN.
...
n. Step n See error

Expected behavior
All the pods must be cleaned successfully and should be stuck in deleting state for long time.

Screenshots
NA

Logs
Share on mail with Tom Watson.

System Information (please complete the following information):
OS/Version: Suse
Kubernetes 1.21
Attaching the logs

Additional context
Add any other context about the problem here.

alikdell · 2022-01-06T20:47:54Z

Here is what we have found from the logs:

Podmon initiate pod cleanup.
Find that node connected status is false.
Podmon report two errors:

Not able to find the pod “level=error msg="Unable to get pod…. not found”
Not able to delete volumeattachment “error msg="Couldn't delete VolumeAttachment …. not found”

Because of above error, Podmon didn't issue delete pod command, hence we see pod still terminating state

Next steps:
Analysis why delete volumeattachment return errors? And also even though delete volumeattachment returns error, it seems there no more volumes to be detached, this also need to be analyzed.

eanjtab · 2022-01-14T19:55:11Z

Hello Team,

Is there any progress on the analysis of this issue ?

hoppea2 · 2022-01-17T13:38:17Z

@eanjtab the team is looking at this and will update the defect.

rbo54 · 2022-01-18T20:25:13Z

I have a fix for the "Not able to delete volume attachment" that will be in the next release. I believe that will fix the issue.

alikdell · 2022-01-27T20:20:52Z

Fix for this issue with pull request: dell/karavi-resiliency#102 is merged.

eanjtab added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Jan 4, 2022

hoppea2 added the area/csm-resiliency Issue pertains to the CSM Resiliency module label Jan 6, 2022

hoppea2 assigned alikdell Jan 6, 2022

alikdell removed the needs-triage Issue requires triage. label Jan 6, 2022

gallacher mentioned this issue Jan 24, 2022

Support for NoExecute taint, pod affinity, and bug fix volume attachment delete dell/karavi-resiliency#102

Merged

10 tasks

gallacher added this to the v1.2.0 milestone Jan 24, 2022

alikdell closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Force delete of pod kicks in late (pod in terminating state for a while) #148

[BUG]: Force delete of pod kicks in late (pod in terminating state for a while) #148

eanjtab commented Jan 4, 2022

alikdell commented Jan 6, 2022

eanjtab commented Jan 14, 2022

hoppea2 commented Jan 17, 2022

rbo54 commented Jan 18, 2022

alikdell commented Jan 27, 2022

[BUG]: Force delete of pod kicks in late (pod in terminating state for a while) #148

[BUG]: Force delete of pod kicks in late (pod in terminating state for a while) #148

Comments

eanjtab commented Jan 4, 2022

alikdell commented Jan 6, 2022

eanjtab commented Jan 14, 2022

hoppea2 commented Jan 17, 2022

rbo54 commented Jan 18, 2022

alikdell commented Jan 27, 2022