Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume can't be detached of deleted Node/VM #221

Closed
yvespp opened this issue May 15, 2020 · 21 comments
Closed

Volume can't be detached of deleted Node/VM #221

yvespp opened this issue May 15, 2020 · 21 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@yvespp
Copy link

yvespp commented May 15, 2020

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
After draining and deleting a Node a volume of a pod is stuck in ContainerCreating phase because the volume can't be mounted. Event: Unable to attach or mount volumes
Logs of the controller show that it can't detach the volume from the deleted vm (kubedev-worker-2bcf0b684616):

kubectl.exe logs -f vsphere-csi-controller-0 --all-containers --tail 30
I0515 11:18:14.322186       1 controller.go:198] Started VA processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322258       1 csi_handler.go:209] CSIHandler: processing VA "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322267       1 csi_handler.go:260] Starting detach operation for "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322336       1 csi_handler.go:267] Detaching "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.322365       1 csi_handler.go:699] Can't get CSINode kubedev-worker-2bcf0b684616.kubernetes.mycorp.test: csinode.storage.k8s.io "kubedev-worker-2bcf0b684616.kubernetes.mycorp.test" not found
I0515 11:18:14.323158       1 csi_handler.go:575] Saving detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.329068       1 controller.go:158] Ignoring VolumeAttachment "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8" change
I0515 11:18:14.329599       1 csi_handler.go:586] Saved detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0515 11:18:14.329626       1 csi_handler.go:219] Error processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8": failed to detach: rpc error: code = Internal desc = Failed to find VirtualMachine for node:"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test". Error: node wasn't found
E0515 11:18:14.322833       1 manager.go:129] Node not found with nodeName kubedev-worker-2bcf0b684616.kubernetes.mycorp.test
E0515 11:18:14.322963       1 controller.go:312] Failed to find VirtualMachine for node:"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test". Error: node wasn't found

There is no Node, CSINode or VM with that name anymore.
In the details of that PVC in the vSphere Container Volumes view there is no vm shown.

What you expected to happen:
The volume can be attached.

How to reproduce it (as minimally and precisely as possible):
Not sure, maybe the app didn't shut down correctly and the volume couldn't be removed before the node and vm was deleted

Anything else we need to know?:
Where is the state stored and can the PV be detached manually somehow?
I tried to recreate the PVC to reattache the same PV and restart the controller pod but it didn't help.
Is there something that I can to during the VM drain/shutdown to avoid this?

Environment:

  • csi-vsphere version: 1.0.2
  • vsphere-cloud-controller-manager version: 1.0.2
  • Kubernetes version: v1.17.5
  • vSphere version: 6.7u3
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.4
  • Kernel (e.g. uname -a): 5.3.0-51-generic
  • Install tools: -
  • Others: -
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 15, 2020
@yvespp
Copy link
Author

yvespp commented May 28, 2020

I now delete the pvc and I disappeared in the cns ui in vsphere. The controller still tried to find the delete node and detach the volume:

I0528 09:18:06.580392       1 csi_handler.go:209] CSIHandler: processing VA "csi-b3ba252d84e0fbcf3960ccc3a15ea762ac3a9264a83c1f01302812b390b61c03"
I0528 09:18:06.580558       1 csi_handler.go:225] CSIHandler: finished processing "csi-3705767c02cf5776301c78cda7caea782de1de082e3ec6a9ffeeed01195d2eab"
I0528 09:18:06.580550       1 csi_handler.go:699] Can't get CSINode kubedev-worker-2bcf0b684616.kubernetes.mycorp.test: csinode.storage.k8s.io "kubedev-worker-2bcf0b684616.kubernetes.mycorp.test" not found
I0528 09:18:06.580565       1 csi_handler.go:231] "csi-b3ba252d84e0fbcf3960ccc3a15ea762ac3a9264a83c1f01302812b390b61c03" is already attached
I0528 09:18:06.580577       1 csi_handler.go:225] CSIHandler: finished processing "csi-b3ba252d84e0fbcf3960ccc3a15ea762ac3a9264a83c1f01302812b390b61c03"
I0528 09:18:06.580531       1 csi_handler.go:231] "csi-3ba1436638c8e8002ab30b3443a84a1742ff10d452be74166f13342819893dd2" is already attached
I0528 09:18:06.580610       1 csi_handler.go:225] CSIHandler: finished processing "csi-3ba1436638c8e8002ab30b3443a84a1742ff10d452be74166f13342819893dd2"
I0528 09:18:06.580339       1 controller.go:198] Started VA processing "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f"
I0528 09:18:06.580633       1 csi_handler.go:209] CSIHandler: processing VA "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f"
I0528 09:18:06.580650       1 csi_handler.go:231] "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f" is already attached
I0528 09:18:06.580662       1 csi_handler.go:225] CSIHandler: finished processing "csi-d8d1c36986029288c0a1b9a223967f15017776782f6383953a65e4908d383b3f"
I0528 09:18:06.581737       1 csi_handler.go:575] Saving detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:18:06.587437       1 controller.go:158] Ignoring VolumeAttachment "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8" change
I0528 09:18:06.587889       1 csi_handler.go:586] Saved detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:18:06.587933       1 csi_handler.go:219] Error processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8": failed to detach: rpc error: code = Internal desc = Failed to find VirtualMachine for node:"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test". Error: node wasn't found

I then removed the finalizer from the pv but now the controller doesn't find the pvc anymore:

I0528 09:26:18.361216       1 controller.go:198] Started VA processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361284       1 csi_handler.go:209] CSIHandler: processing VA "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361297       1 csi_handler.go:260] Starting detach operation for "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361393       1 csi_handler.go:267] Detaching "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.361422       1 csi_handler.go:575] Saving detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.367969       1 controller.go:158] Ignoring VolumeAttachment "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8" change
I0528 09:26:18.368734       1 csi_handler.go:586] Saved detach error to "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8"
I0528 09:26:18.368769       1 csi_handler.go:219] Error processing "csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8": failed to detach: persistentvolume "pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1" not found

Where is this information still stored? Can I clean this up somehow?

@yvespp
Copy link
Author

yvespp commented May 28, 2020

Every two minutes the kube-controller-manager logs this:

W0528 09:43:31.762117       1 reconciler.go:206] attacherDetacher.DetachVolume started for volume "pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^fd853c26-0d37-4c7b-b927-ab05319194db") on node "kubedev-worker-2bcf0b684616.kubernetes.mycorp.test" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
E0528 09:43:31.771641       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^fd853c26-0d37-4c7b-b927-ab05319194db podName: nodeName:}" failed. No retries permitted until 2020-05-28 09:45:33.771608868 +0000 UTC m=+1362976.079051531 (durationBeforeRetry 2m2s). Error: "DetachVolume.Detach failed for volume \"pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^fd853c26-0d37-4c7b-b927-ab05319194db\") on node \"kubedev-worker-2bcf0b684616.kubernetes.mycorp.test\" : persistentvolume \"pvc-bd39a603-ce12-4f42-b41b-77f31ad407b1\" not found"

However the logs stopped after the kube-controller-manager was restared. vsphere-csi-controller still logs the same errors.

@RaunakShah
Copy link
Contributor

@yvespp Can you delete the volumeattachment objects that have been left over using kubectl? Like csi-ad8ca603308ed422a89d11ed4c09a448fc724c3998b0a134fe9a17100a5045c8 from the logs you posted above.

@yvespp
Copy link
Author

yvespp commented May 29, 2020

Didn't know about volumeattachments, thanks!

I delete the volumeattachment and now the errors are gone. I had to remove the finalizer in the resource first or else Kubernetes wouldn't delete it.

  finalizers:
  - external-attacher/csi-vsphere-vmware-com

@yvespp
Copy link
Author

yvespp commented May 29, 2020

Summery so far:

  • Deleting the VM without the vSphere csi controller having time to detach the the disk leaves vSphere and/or the vSphere csi controller confused.
    • The controller tries to detach the volume from a node that no longer exists and fails.
    • Today I had another case where vSphere itself was in a weird state and deattaching/attaching was prevented permanently:
      E0529 06:18:21.088909       1 vsphereutil.go:133] Failed to attach disk 915aa7d3-1be1-4cb4-83c6-7dfcde2ec62f with err CnsFault error: CNS: Failed to attach disk when calling AttachDisk:Fault cause: vim.fault.InvalidState
      
    • There seems to be no way, automatically or manually, to recover from that situation.
  • The controller doesn't clean up volumeattachments to PVs that no longer exist, they have to be cleaned up manually.

@RaunakShah
Copy link
Contributor

RaunakShah commented May 29, 2020

@yvespp are your workloads deployed as statefulsets?

When you delete a VM, kubernetes will not attempt to delete the Pod. If the Pod was part of a Statefulset, then you have to run some manual steps to have it rescheduled on another node. This is being tracked on upstream kubernetes here - kubernetes/kubernetes#65392

From your original description:

Logs of the controller show that it can't detach the volume from the deleted vm (kubedev-worker-2bcf0b684616):

The logs that you have pasted are actually external-attacher logs. Since the detach can't be successful (since the node doesnt exist), you need to manually delete the volumeattachments, as you did in #221 (comment)

@yvespp
Copy link
Author

yvespp commented Jun 2, 2020

@yvespp are your workloads deployed as statefulsets?

When you delete a VM, kubernetes will not attempt to delete the Pod. If the Pod was part of a Statefulset, then you have to run some manual steps to have it rescheduled on another node. This is being tracked on upstream kubernetes here - kubernetes/kubernetes#65392

Yes it's a statefulset. I did drain the node before it was shut down and deleted but somehow the pod was not terminated before the node was shut down and the volume never released.
IMHO this can always happen and Kubernetes or vSphere should not end up in an unrecoverable state because of it (and I understand this project can't address all Kubernetes issues).

On the Kubernetes side deleting the volumeattachment seems to work but vSphere itself can be stuck too: for Failed to attach disk when calling AttachDisk:Fault cause: vim.fault.InvalidState we opened a ticket with VMware support (20128330005).

@yvespp
Copy link
Author

yvespp commented Jun 2, 2020

Would it help if I delete the volumeattachment manually before the finally shutting down the node?What happens it the volume is still attached to the node when the volumeattachmentis deleted? Probably nothing as the CSI can't detach the volume when it's in use?

Maybe I should just wait till all volumeattachments of a Node are removed...

@RaunakShah
Copy link
Contributor

RaunakShah commented Jun 2, 2020

Can you try following these steps, once the node is in the NotReady state:

  1. Wait 6 mins for Pod to enter Terminating state.
  2. Force delete the Pod using the following command
kubectl delete pod <pod-name> <namespace> --grace-period=0 --force

This should create another Pod which will be in the ContainerCreating state.
3. Wait 6 more mins for k8s to delete the volumeattachment object(s). If you Pod is claiming "n" volumes, there will be "n" volumeattachments. Kubernetes should also create new volumeattachment object(s) for these volumes.
4. Now, the Pod that was in the ContainerCreating state for the last 6 mins should shortly enter the Running state.

@yvespp
Copy link
Author

yvespp commented Jun 3, 2020

The node is never in NotReady steady in this case. Our rollout works like this:

  1. New node/VM is created and added/joined to Kubernetes
  2. Old node is drained (kubectl drain) with a 3 minute timeout
  3. Old node is deleted from Kubernetes (kubectl delete node)
  4. Old node vm ist shut down and deleted
    The Problem happens when the drain times out and there are still pods with volumes on the old node when the node and vm get's deleted.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2020
@ebuildy
Copy link

ebuildy commented Sep 7, 2020

I believe this is related to kubernetes-csi/external-attacher#215

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marratj
Copy link

marratj commented Dec 8, 2020

@yvespp any update from VMware in the ticket you opened with them?

We have the same issue that the disk/VMDK itself in vSphere is somehow stuck in this unrecoverable state. We cannot even attach the backing VMDK manually to any VM in vSphere after it got stuck in this state.

@yvespp
Copy link
Author

yvespp commented Dec 8, 2020

@marratj in our case the problem was that Disk Change Block Tracking (CBT) was enabled by our backup tool in the background but CBT was not enabled on the VM and the disk could not be attached again.
To enable CBT see: https://kb.vmware.com/s/article/1031873

@marratj
Copy link

marratj commented Dec 9, 2020

@yvespp thanks a lot! That was exactly it also in our case.

@mitchellmaler
Copy link

Still ran into this running Kubernetes v1.21.3. The node was deleted and the VolumeAttachment was still around specifying the old node name. Required us to update the finalizers on each resource to be deleted so the pods can start up.

@tgelter
Copy link

tgelter commented Aug 9, 2021

Still ran into this running Kubernetes v1.21.3. The node was deleted and the VolumeAttachment was still around specifying the old node name. Required us to update the finalizers on each resource to be deleted so the pods can start up.

Hello @mitchellmaler, since this issue is closed, I'd suggest you take a look and comment here: #359.

@mitchellmaler
Copy link

@tgelter thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants