-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eksctl anywhere upgrade leaves drivers/controllers in broken state #896
Comments
It seems what's causing the issue is non-graceful/aggressive shutdown of existing worker nodes before the volumes are able to un-attach: kubernetes/enhancements#1116 But then the trade-off would be a longer cluster upgrade time. |
Work-around for the stale VolumeAttachments, here's how to delete them without using the csi driver: % kubectl get VolumeAttachments | grep dev-md-0-6f5f5c955-rghzs
csi-17ea8261329e1e8e85160cad351f24294ce52a59efd1a222a464cc73b4184518 csi.vsphere.vmware.com pvc-4143115b-dbd2-4a19-a114-1e871e9b851e dev-md-0-6f5f5c955-rghzs true 30d
csi-7f6a00400ffb52429987f76ade13938a8e3f558214764d03a921e4c5b6b2bb30 csi.vsphere.vmware.com pvc-200e1a3c-8974-4bca-882c-92ca1a0aa0e0 dev-md-0-6f5f5c955-rghzs true 47d
csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32 csi.vsphere.vmware.com pvc-a00a5328-93e1-4eaf-b7b6-ad2e5a362ee0 dev-md-0-6f5f5c955-rghzs true 30d
csi-bfafbc70d94ab32121f4ba1374462c689d6e8546191410a034af5f5f35f4d076 csi.vsphere.vmware.com pvc-2227fc30-f155-4458-89db-9dcbb81a5927 dev-md-0-6f5f5c955-rghzs true 43h
% kubectl edit VolumeAttachment csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
annotations:
csi.alpha.kubernetes.io/node-id: dev-md-0-6f5f5c955-rghzs
creationTimestamp: "2021-11-29T20:54:10Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2021-12-28T22:45:04Z"
# DELETE finalizers:
# DELETE - external-attacher/csi-vsphere-vmware-com
name: csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32 After cleaning up the stale Volume Attachments, the controllers stopped having leadership election issues. |
Looking into this. Thanks @smarsh-tim ! |
We haven't got to this yet. I'm putting it in the backlog for further investigation. |
this issue is becoming more serious, I'm finding that randomly pods can not be created as I am getting a volume is current in use error on container creation when upgrading deployment images as well.
|
this is getting worse and worse each day, every time a pod gets restarted its a 50/50 chance that it can create the container cause the volume shows in use, its becoming to a point were EKS-A is not usable in a production state, when will this be fixed ? |
I'll also note that the workaround in this ticket works only if I also detach the cns volume using vcenter mob api, but this is a very lengthy process that also works some of the time |
What happened:
After running a cluster upgrade command (where in my case the worker nodes were all replaced), the vsphere-csi-controller has started throwing errors:
However, I can confirm that the PersistentVolumes successfully re-attached to the new worker nodes. My persistent workloads have no data loss post-upgrade.
Same with capi-controller-manager:
This seems to result in a number of services experiencing errors like:
Seeing this with:
It seems that the old worker nodes are still being referenced.
What you expected to happen:
References to old worker nodes to be cleaned up - including VolumeAttachments
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
This seems to be similar to kubernetes-csi/external-attacher#215
Environment:
The text was updated successfully, but these errors were encountered: