Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Container is terminated but Pod is stuck in terminating #146

Closed
hiteshmathur19 opened this issue Dec 28, 2021 · 7 comments
Closed
Assignees
Labels
area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@hiteshmathur19
Copy link

Describe the bug
On deleting a pod which is using powerflex csi for mounting volume, the pod stuck in terminating for ever.

To Reproduce
Steps to reproduce the behavior:
kubectl get po -A -o wide| grep Termin
chf-apps csdpbl002-0 0/1 Terminating 0

kubectl get pvc -A | grep csdpbl002-0
chf-apps pv-fds-csdpbl002-0 Bound ccd-af9659f2b3

On creation of pod :
MountVolume.WaitForAttach succeeded for volume "ccd-af9659f2b3"
device mount path "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/ccd-af9659f2b3/globalmount"

On deleting the pod:
/volume-subpaths/ccd-af9659f2b3/sdp/0" is not a mountpoint, deleting

But the mountpoint couldn’t be deleted, since its already in use by the application.
umount: /var/lib/kubelet/pods/69fd83d4-eb2c-4eaf-aa2e-8436319070f6/volumes/kubernetes.io~csi/ccd-af9659f2b3/mount: target is busy.\n"

After sometime volume did get detached. But the pod still get stuck in terminating.
Volume detached for volume "ccd-af9659f2b3\

After restarting the kubelet:

"operationExecutor.UnmountVolume started for volume "ccd-af9659f2b3"
UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/csi-vxflexos.dellemc.com^62ce8cd81be7f10f-eaf5ab8a00000502
failed to open volume data file [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/ccd-af9659f2b3/vol_data.json]: open /var/lib/kubelet/plugins/kubernetes.io/csi/pv/ccd-af9659f2b3/vol_data.json: no such file or directory

Only those pods which have error "target is busy" , showing this problem of stuck in terminating.

Expected behavior
On deleting pod, it should successfully get deleted.

Screenshots
If applicable, add screenshots to help explain your problem.

Logs
If applicable, submit logs or stack traces from the affected services

System Information (please complete the following information):

  • OS/Version: Suse
  • Kubernetes 1.21
  • Attaching the logs

Additional context
Add any other context about the problem here.
driver.logs.20211227_0335.zip
kubelet.zip

@hiteshmathur19 hiteshmathur19 added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Dec 28, 2021
@randeepdell randeepdell added the area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex label Dec 29, 2021
@hiteshmathur19
Copy link
Author

Any update on this please ?

@prablr79
Copy link
Collaborator

prablr79 commented Jan 4, 2022

@bharathsreekanth customer is looking for updates here.

@bharathsreekanth
Copy link
Contributor

I will have a look at this today and discuss with the team and get back. Thanks!

@nb950
Copy link

nb950 commented Jan 4, 2022

as per node logs volume unpublish report "target is busy" . assume this is true , you are shutting down pod with application running ?
next we see driver node log shows "volume XX is not published to node"
time="2021-12-21T16:16:14Z" level=info msg="SDC returned volume eaf5ab8a00000502 on system 62ce8cd81be7f10f not published to node"
the controller unpublish could have done this , however logs are missing , please send us driver controller logs ?
the podmon logs are from 12/27 and the driver logs are from 12/21

can you help retest and get all the logs immediately as the issue occurs ?
is there something different about the app on the pod , can you share any info ,can it cause the node network to appear frozen which it turn forces podmon to do something different ?

@bharathsreekanth
Copy link
Contributor

Based on the last update, the podmon issue needs to to triaged first. Assigning to @alikdell to triage.

@alikdell alikdell removed the needs-triage Issue requires triage. label Jan 10, 2022
@alikdell
Copy link
Contributor

alikdell commented Jan 11, 2022

Send email requesting customer to reproduce the issue and send us fresh logs.
Customer agreed to provide us new sets of logs whenever they are able reproduce the issue at their end.

@hoppea2 hoppea2 added area/csm-resiliency Issue pertains to the CSM Resiliency module area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex and removed area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex labels Feb 16, 2022
@alikdell
Copy link
Contributor

alikdell commented Mar 2, 2022

Not able to reproduce, closing the issue for now.
If issue observed again in future this ticket will be reopened.

@alikdell alikdell closed this as completed Mar 2, 2022
@medegw01 medegw01 added this to the v1.2.0 milestone Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi-powerflex Issue pertains to the CSI Driver for Dell EMC PowerFlex area/csm-resiliency Issue pertains to the CSM Resiliency module type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

8 participants