eksctl anywhere upgrade leaves drivers/controllers in broken state #896

smarsh-tim · 2021-12-30T16:58:48Z

What happened:
After running a cluster upgrade command (where in my case the worker nodes were all replaced), the vsphere-csi-controller has started throwing errors:

Error processing "csi-7f6a00400ffb52421987f76ade13928a8e3f5582144chd03a921e4b5b6b2bb30": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"dev-md-0-6f5f5c955-rghzs". Error: node wasn't fo

However, I can confirm that the PersistentVolumes successfully re-attached to the new worker nodes. My persistent workloads have no data loss post-upgrade.

Same with capi-controller-manager:

E1230 16:46:14.917830       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="dev-md-0-5b6bd949cd-qzpxs"
E1230 16:46:14.917864       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="dev-md-0-5b6bd949cd-qzpxs"
E1230 16:46:55.067653       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="could not find infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereMachine \"dev-worker-node-template-1640730669596-qsnqj\" for Machine \"dev-md-0-5b6bd949cd-xwnbv\" in namespace \"eksa-system\", requeuing: requeue in 30s" "controller"="machine" "name"="dev-md-0-5b6bd949cd-xwnbv" "namespace"="eksa-system"
E1230 16:47:10.156694       1 leaderelection.go:331] error retrieving resource lock capi-system/controller-leader-election-capi: etcdserver: leader changed
E1230 16:47:10.345013       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="could not find infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereMachine \"dev-worker-node-template-1640730669596-5rzbc\" for Machine \"dev-md-0-5b6bd949cd-bcwxw\" in namespace \"eksa-system\", requeuing: requeue in 30s" "controller"="machine" "name"="dev-md-0-5b6bd949cd-bcwxw" "namespace"="eksa-system"
I1230 16:47:11.602700

This seems to result in a number of services experiencing errors like:

Error: error running manager: leader election lost

Seeing this with:

kube-vip-cloud-provider
etcadm-bootstrap-provider-controller
capi-kubeadm-control-plane-controller
eks-controller-manager
etc (looks like anything that does leader election)

It seems that the old worker nodes are still being referenced.

What you expected to happen:
References to old worker nodes to be cleaned up - including VolumeAttachments

How to reproduce it (as minimally and precisely as possible):

Create a cluster
Assign a PersistentVolume to a workload
Update the EKS Anywhere cluster spec to a new template (or something that would trigger a node replacement)
Post upgrade - observe the errors in vsphere-csi-driver

Anything else we need to know?:
This seems to be similar to kubernetes-csi/external-attacher#215

Environment:

EKS Anywhere Release: <unk, where do I pull this my CLI version is 0.66.0>
EKS Distro Release: v1.21.2-eks-1-21-5

The text was updated successfully, but these errors were encountered:

smarsh-tim · 2021-12-30T17:09:13Z

It seems what's causing the issue is non-graceful/aggressive shutdown of existing worker nodes before the volumes are able to un-attach: kubernetes/enhancements#1116

But then the trade-off would be a longer cluster upgrade time.

smarsh-tim · 2021-12-30T17:21:56Z

Work-around for the stale VolumeAttachments, here's how to delete them without using the csi driver:

% kubectl get VolumeAttachments | grep dev-md-0-6f5f5c955-rghzs                                                            
csi-17ea8261329e1e8e85160cad351f24294ce52a59efd1a222a464cc73b4184518   csi.vsphere.vmware.com   pvc-4143115b-dbd2-4a19-a114-1e871e9b851e  dev-md-0-6f5f5c955-rghzs    true       30d
csi-7f6a00400ffb52429987f76ade13938a8e3f558214764d03a921e4c5b6b2bb30   csi.vsphere.vmware.com   pvc-200e1a3c-8974-4bca-882c-92ca1a0aa0e0  dev-md-0-6f5f5c955-rghzs    true       47d
csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32   csi.vsphere.vmware.com   pvc-a00a5328-93e1-4eaf-b7b6-ad2e5a362ee0  dev-md-0-6f5f5c955-rghzs    true       30d
csi-bfafbc70d94ab32121f4ba1374462c689d6e8546191410a034af5f5f35f4d076   csi.vsphere.vmware.com   pvc-2227fc30-f155-4458-89db-9dcbb81a5927  dev-md-0-6f5f5c955-rghzs    true       43h
% kubectl edit VolumeAttachment csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: dev-md-0-6f5f5c955-rghzs
  creationTimestamp: "2021-11-29T20:54:10Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-12-28T22:45:04Z"
  # DELETE finalizers:
  # DELETE - external-attacher/csi-vsphere-vmware-com
  name: csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32

After cleaning up the stale Volume Attachments, the controllers stopped having leadership election issues.

g-gaston · 2022-01-03T15:14:49Z

Looking into this. Thanks @smarsh-tim !

g-gaston · 2022-01-19T20:35:30Z

We haven't got to this yet. I'm putting it in the backlog for further investigation.

echel0n · 2022-10-10T16:48:17Z

this issue is becoming more serious, I'm finding that randomly pods can not be created as I am getting a volume is current in use error on container creation when upgrading deployment images as well.

Warning FailedAttachVolume 17m attachdetach-controller AttachVolume.Attach failed for volume "pvc-e1610bdf-5420-444e-99bd-a274642b66e7" : rpc error: code = Internal desc = failed to attach disk: "6b284a9a-582d-4ecc-bb94-bf49017e587a" with node: "192.168.22.161" err failed to attach cns volume: "6b284a9a-582d-4ecc-bb94-bf49017e587a" to node vm: "VirtualMachine:vm-5190 [VirtualCenterHost: vcenter.vsphere.xxx.xxx UUID: 421d6be9-800f-38ca-557d-4fa14b7895a9, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.vsphere.xxx.xxx]]". fault: "(*types.LocalizedMethodFault)(0xc000c4f8e0)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (*types.ResourceInUse)(0xc000c6d140)({\n VimFault: (types.VimFault) {\n MethodFault: (types.MethodFault) {\n FaultCause: (*types.LocalizedMethodFault)(<nil>),\n FaultMessage: ([]types.LocalizableMessage) <nil>\n }\n },\n Type: (string) \"\",\n Name: (string) (len=6) \"volume\"\n }),\n LocalizedMessage: (string) (len=32) \"The resource 'volume' is in use.\"\n})\n". opId: "101159e7"

echel0n · 2022-10-13T14:04:11Z

this is getting worse and worse each day, every time a pod gets restarted its a 50/50 chance that it can create the container cause the volume shows in use, its becoming to a point were EKS-A is not usable in a production state, when will this be fixed ?

echel0n · 2022-10-13T14:28:21Z

I'll also note that the workaround in this ticket works only if I also detach the cns volume using vcenter mob api, but this is a very lengthy process that also works some of the time

g-gaston self-assigned this Jan 3, 2022

g-gaston removed their assignment Jan 19, 2022

g-gaston added this to the backlog milestone Jan 19, 2022

jaxesn added the external An issue, bug or feature request filed from outside the AWS org label Apr 25, 2022

echel0n mentioned this issue Oct 10, 2022

Clarification on vSphere DRS and HA for EKS-A #3586

Closed

echel0n mentioned this issue Oct 13, 2022

vSphere CSI volume attachment issues #3666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eksctl anywhere upgrade leaves drivers/controllers in broken state #896

eksctl anywhere upgrade leaves drivers/controllers in broken state #896

smarsh-tim commented Dec 30, 2021

smarsh-tim commented Dec 30, 2021

smarsh-tim commented Dec 30, 2021 •

edited

Loading

g-gaston commented Jan 3, 2022

g-gaston commented Jan 19, 2022

echel0n commented Oct 10, 2022 •

edited

Loading

echel0n commented Oct 13, 2022 •

edited

Loading

echel0n commented Oct 13, 2022 •

edited

Loading

eksctl anywhere upgrade leaves drivers/controllers in broken state #896

eksctl anywhere upgrade leaves drivers/controllers in broken state #896

Comments

smarsh-tim commented Dec 30, 2021

smarsh-tim commented Dec 30, 2021

smarsh-tim commented Dec 30, 2021 • edited Loading

g-gaston commented Jan 3, 2022

g-gaston commented Jan 19, 2022

echel0n commented Oct 10, 2022 • edited Loading

echel0n commented Oct 13, 2022 • edited Loading

echel0n commented Oct 13, 2022 • edited Loading

smarsh-tim commented Dec 30, 2021 •

edited

Loading

echel0n commented Oct 10, 2022 •

edited

Loading

echel0n commented Oct 13, 2022 •

edited

Loading

echel0n commented Oct 13, 2022 •

edited

Loading