Volume can't be detached of deleted Node/VM #221

yvespp · 2020-05-15T11:51:48Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
After draining and deleting a Node a volume of a pod is stuck in ContainerCreating phase because the volume can't be mounted. Event: Unable to attach or mount volumes
Logs of the controller show that it can't detach the volume from the deleted vm (kubedev-worker-2bcf0b684616):

kubectl.exe logs -f vsphere-csi-controller-0 --all-containers --tail 30
I0515 11:18:14.322186       1 controller.go:198] Started VA processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322258       1 csi_handler.go:209] CSIHandler: processing VA "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322267       1 csi_handler.go:260] Starting detach operation for "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322336       1 csi_handler.go:267] Detaching "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322365       1 csi_handler.go:699] Can't get CSINode kubedev-worker-2bcf0b684616.kubernetes.mycorp.test: csinode.storage.k8s.io "kubedev-worker-2bcf0b684616.kubernetes.mycorp.test" not found
I0515 11:18:14.323158       1 csi_handler.go:575] Saving detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.329068       1 controller.go:158] Ignoring VolumeAttachment "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8" change
I0515 11:18:14.329599       1 csi_handler.go:586] Saved detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.329626       1 csi_handler.go:219] Error processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8": failed to detach: rpc error: code = Internal desc = Failed to find VirtualMachine for node:"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test". Error: node wasn't found
E0515 11:18:14.322833       1 manager.go:129] Node not found with nodeName kubedev-worker-2bcf0b684616.kubernetes.mycorp.test
E0515 11:18:14.322963       1 controller.go:312] Failed to find VirtualMachine for node:"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test". Error: node wasn't found

There is no Node, CSINode or VM with that name anymore.
In the details of that PVC in the vSphere Container Volumes view there is no vm shown.

What you expected to happen:
The volume can be attached.

How to reproduce it (as minimally and precisely as possible):
Not sure, maybe the app didn't shut down correctly and the volume couldn't be removed before the node and vm was deleted

Anything else we need to know?:
Where is the state stored and can the PV be detached manually somehow?
I tried to recreate the PVC to reattache the same PV and restart the controller pod but it didn't help.
Is there something that I can to during the VM drain/shutdown to avoid this?

Environment:

csi-vsphere version: 1.0.2
vsphere-cloud-controller-manager version: 1.0.2
Kubernetes version: v1.17.5
vSphere version: 6.7u3
OS (e.g. from /etc/os-release): Ubuntu 18.04.4
Kernel (e.g. uname -a): 5.3.0-51-generic
Install tools: -
Others: -

The text was updated successfully, but these errors were encountered:

yvespp · 2020-05-28T09:39:54Z

I now delete the pvc and I disappeared in the cns ui in vsphere. The controller still tried to find the delete node and detach the volume:

I0528 09:18:06.580392       1 csi_handler.go:209] CSIHandler: processing VA "csi-b3ba252d84e0fbcf3960ccc3a15ea762ac3a9264a83c1f01302812b390b61c03"
I0528 09:18:06.580558       1 csi_handler.go:225] CSIHandler: finished processing "csi-3705767c02cf5776301c78cda7caea782de1de082e3ec6a9ffeeed01195d2eab"
I0528 09:18:06.580550       1 csi_handler.go:699] Can't get CSINode kubedev-worker-2bcf0b684616.kubernetes.mycorp.test: csinode.storage.k8s.io "kubedev-worker-2bcf0b684616.kubernetes.mycorp.test" not found
I0528 09:18:06.580565       1 csi_handler.go:231] "csi-b3ba252d84e0fbcf3960ccc3a15ea762ac3a9264a83c1f01302812b390b61c03" is already attached
I0528 09:18:06.580577       1 csi_handler.go:225] CSIHandler: finished processing "csi-b3ba252d84e0fbcf3960ccc3a15ea762ac3a9264a83c1f01302812b390b61c03"
I0528 09:18:06.580531       1 csi_handler.go:231] "csi-3ba1436638c8e8002ab30b3443a84a1742ff10d452be74166f13342819893dd2" is already attached
I0528 09:18:06.580610       1 csi_handler.go:225] CSIHandler: finished processing "csi-3ba1436638c8e8002ab30b3443a84a1742ff10d452be74166f13342819893dd2"
I0528 09:18:06.580339       1 controller.go:198] Started VA processing "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f"
I0528 09:18:06.580633       1 csi_handler.go:209] CSIHandler: processing VA "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f"
I0528 09:18:06.580650       1 csi_handler.go:231] "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f" is already attached
I0528 09:18:06.580662       1 csi_handler.go:225] CSIHandler: finished processing "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f"
I0528 09:18:06.581737       1 csi_handler.go:575] Saving detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:18:06.587437       1 controller.go:158] Ignoring VolumeAttachment "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8" change
I0528 09:18:06.587889       1 csi_handler.go:586] Saved detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:18:06.587933       1 csi_handler.go:219] Error processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8": failed to detach: rpc error: code = Internal desc = Failed to find VirtualMachine for node:"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test". Error: node wasn't found

I then removed the finalizer from the pv but now the controller doesn't find the pvc anymore:

I0528 09:26:18.361216       1 controller.go:198] Started VA processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361284       1 csi_handler.go:209] CSIHandler: processing VA "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361297       1 csi_handler.go:260] Starting detach operation for "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361393       1 csi_handler.go:267] Detaching "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361422       1 csi_handler.go:575] Saving detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.367969       1 controller.go:158] Ignoring VolumeAttachment "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8" change
I0528 09:26:18.368734       1 csi_handler.go:586] Saved detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.368769       1 csi_handler.go:219] Error processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8": failed to detach: persistentvolume "pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1" not found

Where is this information still stored? Can I clean this up somehow?

yvespp · 2020-05-28T09:55:52Z

Every two minutes the kube-controller-manager logs this:

W0528 09:43:31.762117       1 reconciler.go:206] attacherDetacher.DetachVolume started for volume "pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^fd853c26-0d37-4c7b-b927-ab05319194db") on node "kubedev-worker-2bcf0b684616.kubernetes.mycorp.test" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
E0528 09:43:31.771641       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^fd853c26-0d37-4c7b-b927-ab05319194db podName: nodeName:}" failed. No retries permitted until 2020-05-28 09:45:33.771608868 +0000 UTC m=+1362976.079051531 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^fd853c26-0d37-4c7b-b927-ab05319194db\") on node \"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test\" : persistentvolume \"pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1\" not found"

However the logs stopped after the kube-controller-manager was restared. vsphere-csi-controller still logs the same errors.

RaunakShah · 2020-05-28T15:54:57Z

@yvespp Can you delete the volumeattachment objects that have been left over using kubectl? Like csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8 from the logs you posted above.

yvespp · 2020-05-29T13:35:49Z

Didn't know about volumeattachments, thanks!

I delete the volumeattachment and now the errors are gone. I had to remove the finalizer in the resource first or else Kubernetes wouldn't delete it.

  finalizers:
  - external-attacher/csi-vsphere-vmware-com

yvespp · 2020-05-29T13:56:49Z

Summery so far:

Deleting the VM without the vSphere csi controller having time to detach the the disk leaves vSphere and/or the vSphere csi controller confused.
- The controller tries to detach the volume from a node that no longer exists and fails.
- Today I had another case where vSphere itself was in a weird state and deattaching/attaching was prevented permanently:
```
E0529 06:18:21.088909       1 vsphereutil.go:133] Failed to attach disk 915aa7d3-1be1-4cb4-83c6-7dfcde2ec62f with err CnsFault error: CNS: Failed to attach disk when calling AttachDisk:Fault cause: vim.fault.InvalidState
```
- There seems to be no way, automatically or manually, to recover from that situation.
The controller doesn't clean up volumeattachments to PVs that no longer exist, they have to be cleaned up manually.

RaunakShah · 2020-05-29T23:13:06Z

@yvespp are your workloads deployed as statefulsets?

When you delete a VM, kubernetes will not attempt to delete the Pod. If the Pod was part of a Statefulset, then you have to run some manual steps to have it rescheduled on another node. This is being tracked on upstream kubernetes here - kubernetes/kubernetes#65392

From your original description:

Logs of the controller show that it can't detach the volume from the deleted vm (kubedev-worker-2bcf0b684616):

The logs that you have pasted are actually external-attacher logs. Since the detach can't be successful (since the node doesnt exist), you need to manually delete the volumeattachments, as you did in #221 (comment)

yvespp · 2020-06-02T06:58:05Z

@yvespp are your workloads deployed as statefulsets?

When you delete a VM, kubernetes will not attempt to delete the Pod. If the Pod was part of a Statefulset, then you have to run some manual steps to have it rescheduled on another node. This is being tracked on upstream kubernetes here - kubernetes/kubernetes#65392

Yes it's a statefulset. I did drain the node before it was shut down and deleted but somehow the pod was not terminated before the node was shut down and the volume never released.
IMHO this can always happen and Kubernetes or vSphere should not end up in an unrecoverable state because of it (and I understand this project can't address all Kubernetes issues).

On the Kubernetes side deleting the volumeattachment seems to work but vSphere itself can be stuck too: for Failed to attach disk when calling AttachDisk:Fault cause: vim.fault.InvalidState we opened a ticket with VMware support (20128330005).

yvespp · 2020-06-02T07:39:32Z

Would it help if I delete the volumeattachment manually before the finally shutting down the node?What happens it the volume is still attached to the node when the volumeattachmentis deleted? Probably nothing as the CSI can't detach the volume when it's in use?

Maybe I should just wait till all volumeattachments of a Node are removed...

RaunakShah · 2020-06-02T16:56:36Z

Can you try following these steps, once the node is in the NotReady state:

Wait 6 mins for Pod to enter Terminating state.
Force delete the Pod using the following command

kubectl delete pod <pod-name> <namespace> --grace-period=0 --force

This should create another Pod which will be in the ContainerCreating state.
3. Wait 6 more mins for k8s to delete the volumeattachment object(s). If you Pod is claiming "n" volumes, there will be "n" volumeattachments. Kubernetes should also create new volumeattachment object(s) for these volumes.
4. Now, the Pod that was in the ContainerCreating state for the last 6 mins should shortly enter the Running state.

yvespp · 2020-06-03T06:52:50Z

The node is never in NotReady steady in this case. Our rollout works like this:

New node/VM is created and added/joined to Kubernetes
Old node is drained (kubectl drain) with a 3 minute timeout
Old node is deleted from Kubernetes (kubectl delete node)
Old node vm ist shut down and deleted
The Problem happens when the drain times out and there are still pods with volumes on the old node when the node and vm get's deleted.

fejta-bot · 2020-09-01T07:33:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ebuildy · 2020-09-07T12:30:54Z

I believe this is related to kubernetes-csi/external-attacher#215

fejta-bot · 2020-10-07T12:40:52Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-11-06T13:23:28Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-11-06T13:23:36Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

marratj · 2020-12-08T13:54:08Z

@yvespp any update from VMware in the ticket you opened with them?

We have the same issue that the disk/VMDK itself in vSphere is somehow stuck in this unrecoverable state. We cannot even attach the backing VMDK manually to any VM in vSphere after it got stuck in this state.

yvespp · 2020-12-08T14:37:19Z

@marratj in our case the problem was that Disk Change Block Tracking (CBT) was enabled by our backup tool in the background but CBT was not enabled on the VM and the disk could not be attached again.
To enable CBT see: https://kb.vmware.com/s/article/1031873

marratj · 2020-12-09T09:52:11Z

@yvespp thanks a lot! That was exactly it also in our case.

mitchellmaler · 2021-08-09T21:19:05Z

Still ran into this running Kubernetes v1.21.3. The node was deleted and the VolumeAttachment was still around specifying the old node name. Required us to update the finalizers on each resource to be deleted so the pods can start up.

tgelter · 2021-08-09T21:49:20Z

Still ran into this running Kubernetes v1.21.3. The node was deleted and the VolumeAttachment was still around specifying the old node name. Required us to update the finalizers on each resource to be deleted so the pods can start up.

Hello @mitchellmaler, since this issue is closed, I'd suggest you take a look and comment here: #359.

mitchellmaler · 2021-08-09T21:51:24Z

@tgelter thanks!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 15, 2020

dfsdevops mentioned this issue Aug 29, 2020

Vsphere Cloud Provider: failed to detach volume from shutdown node kubernetes/kubernetes#75342

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2020

k8s-ci-robot closed this as completed Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume can't be detached of deleted Node/VM #221

Volume can't be detached of deleted Node/VM #221

yvespp commented May 15, 2020

yvespp commented May 28, 2020 •

edited

Loading

yvespp commented May 28, 2020 •

edited

Loading

RaunakShah commented May 28, 2020

yvespp commented May 29, 2020

yvespp commented May 29, 2020

RaunakShah commented May 29, 2020 •

edited

Loading

yvespp commented Jun 2, 2020

yvespp commented Jun 2, 2020

RaunakShah commented Jun 2, 2020 •

edited

Loading

yvespp commented Jun 3, 2020

fejta-bot commented Sep 1, 2020

ebuildy commented Sep 7, 2020

fejta-bot commented Oct 7, 2020

fejta-bot commented Nov 6, 2020

k8s-ci-robot commented Nov 6, 2020

marratj commented Dec 8, 2020

yvespp commented Dec 8, 2020

marratj commented Dec 9, 2020

mitchellmaler commented Aug 9, 2021

tgelter commented Aug 9, 2021

mitchellmaler commented Aug 9, 2021

Volume can't be detached of deleted Node/VM #221

Volume can't be detached of deleted Node/VM #221

Comments

yvespp commented May 15, 2020

yvespp commented May 28, 2020 • edited Loading

yvespp commented May 28, 2020 • edited Loading

RaunakShah commented May 28, 2020

yvespp commented May 29, 2020

yvespp commented May 29, 2020

RaunakShah commented May 29, 2020 • edited Loading

yvespp commented Jun 2, 2020

yvespp commented Jun 2, 2020

RaunakShah commented Jun 2, 2020 • edited Loading

yvespp commented Jun 3, 2020

fejta-bot commented Sep 1, 2020

ebuildy commented Sep 7, 2020

fejta-bot commented Oct 7, 2020

fejta-bot commented Nov 6, 2020

k8s-ci-robot commented Nov 6, 2020

marratj commented Dec 8, 2020

yvespp commented Dec 8, 2020

marratj commented Dec 9, 2020

mitchellmaler commented Aug 9, 2021

tgelter commented Aug 9, 2021

mitchellmaler commented Aug 9, 2021

yvespp commented May 28, 2020 •

edited

Loading

yvespp commented May 28, 2020 •

edited

Loading

RaunakShah commented May 29, 2020 •

edited

Loading

RaunakShah commented Jun 2, 2020 •

edited

Loading