Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Force delete of pod kicks in late (pod in terminating state for a while) #148

Closed
eanjtab opened this issue Jan 4, 2022 · 5 comments
Assignees
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@eanjtab
Copy link

eanjtab commented Jan 4, 2022

Describe the bug
When testing a node reboot with a real application we observed that there were two pods as below in terminating state for quite some time and then got finally deleted after the CSI node pod came up running on that WN mm-pool01-blccdmm01-wn001.

eric-bss-em-fm-datbel01fop01-ftp-6dfcb5cfbc-s7n64 0/2 Terminating 0 7m32s mm-pool01-blccdmm01-wn001
eric-bss-em-fm-datbel01nef01-5f6868cbf4-q5spf 0/4 Terminating 0 12m mm-pool01-blccdmm01-wn001

  1. So we wanted to clarify that after the force delete is triggered by CSI which we saw in the logs does it need the CSI to be running to complete the deletion (release and unmount volumes) ?
  2. How about the situation of WN shutdown in which case CSI node pod will be never be running on that node, will force delete work here ?
  3. Did you test a WN total shutdown scenario ?
  4. Is there a specific threshold after which the force delete is triggered ?

time="Fri, 17 Dec 2021 19:55:21 UTC" level=info msg="Deleting pod cembel01/eric-bss-em-fm-datbel01fop01-657c646df6-fmlqb force true"
time="Fri, 17 Dec 2021 19:56:22 UTC" level=info msg="Deleting pod cembel01/eric-bss-em-fm-datbel01nef01-5f6868cbf4-q5spf force true"
time="Fri, 17 Dec 2021 19:56:23 UTC" level=info msg="Deleting pod cembel01/eric-bss-em-fm-datbel01fop01-ftp-6dfcb5cfbc-s7n64 force true"
.

To Reproduce
Steps to reproduce the behavior:

  1. Start about 50 pods with heavy IO. Mix of pods sharing volumes with pod affinity rule.
  2. Restart the WN.
  3. Observe that few pods might be still in deleting state and force delete completes after the CSI node is up running again on the WN.
    ...
    n. Step n See error

Expected behavior
All the pods must be cleaned successfully and should be stuck in deleting state for long time.

Screenshots
NA

Logs
Share on mail with Tom Watson.

System Information (please complete the following information):
OS/Version: Suse
Kubernetes 1.21
Attaching the logs

Additional context
Add any other context about the problem here.

@eanjtab eanjtab added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Jan 4, 2022
@hoppea2 hoppea2 added the area/csm-resiliency Issue pertains to the CSM Resiliency module label Jan 6, 2022
@alikdell alikdell removed the needs-triage Issue requires triage. label Jan 6, 2022
@alikdell
Copy link
Contributor

alikdell commented Jan 6, 2022

Here is what we have found from the logs:

  1. Podmon initiate pod cleanup.
  2. Find that node connected status is false.
  3. Podmon report two errors:

Not able to find the pod “level=error msg="Unable to get pod…. not found”
Not able to delete volumeattachment “error msg="Couldn't delete VolumeAttachment …. not found”

  1. Because of above error, Podmon didn't issue delete pod command, hence we see pod still terminating state

Next steps:
Analysis why delete volumeattachment return errors? And also even though delete volumeattachment returns error, it seems there no more volumes to be detached, this also need to be analyzed.

@eanjtab
Copy link
Author

eanjtab commented Jan 14, 2022

Hello Team,

Is there any progress on the analysis of this issue ?

@hoppea2
Copy link
Collaborator

hoppea2 commented Jan 17, 2022

@eanjtab the team is looking at this and will update the defect.

@rbo54
Copy link

rbo54 commented Jan 18, 2022

I have a fix for the "Not able to delete volume attachment" that will be in the next release. I believe that will fix the issue.

@alikdell
Copy link
Contributor

Fix for this issue with pull request: dell/karavi-resiliency#102 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

5 participants