Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eksctl anywhere upgrade leaves drivers/controllers in broken state #896

Open
smarsh-tim opened this issue Dec 30, 2021 · 7 comments
Open
Labels
external An issue, bug or feature request filed from outside the AWS org
Milestone

Comments

@smarsh-tim
Copy link

What happened:
After running a cluster upgrade command (where in my case the worker nodes were all replaced), the vsphere-csi-controller has started throwing errors:

Error processing "csi-7f6a00400ffb52421987f76ade13928a8e3f5582144chd03a921e4b5b6b2bb30": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"dev-md-0-6f5f5c955-rghzs". Error: node wasn't fo

However, I can confirm that the PersistentVolumes successfully re-attached to the new worker nodes. My persistent workloads have no data loss post-upgrade.

Same with capi-controller-manager:

E1230 16:46:14.917830       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="dev-md-0-5b6bd949cd-qzpxs"
E1230 16:46:14.917864       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="dev-md-0-5b6bd949cd-qzpxs"
E1230 16:46:55.067653       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="could not find infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereMachine \"dev-worker-node-template-1640730669596-qsnqj\" for Machine \"dev-md-0-5b6bd949cd-xwnbv\" in namespace \"eksa-system\", requeuing: requeue in 30s" "controller"="machine" "name"="dev-md-0-5b6bd949cd-xwnbv" "namespace"="eksa-system"
E1230 16:47:10.156694       1 leaderelection.go:331] error retrieving resource lock capi-system/controller-leader-election-capi: etcdserver: leader changed
E1230 16:47:10.345013       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="could not find infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereMachine \"dev-worker-node-template-1640730669596-5rzbc\" for Machine \"dev-md-0-5b6bd949cd-bcwxw\" in namespace \"eksa-system\", requeuing: requeue in 30s" "controller"="machine" "name"="dev-md-0-5b6bd949cd-bcwxw" "namespace"="eksa-system"
I1230 16:47:11.602700 

This seems to result in a number of services experiencing errors like:

Error: error running manager: leader election lost

Seeing this with:

  • kube-vip-cloud-provider
  • etcadm-bootstrap-provider-controller
  • capi-kubeadm-control-plane-controller
  • eks-controller-manager
  • etc (looks like anything that does leader election)

It seems that the old worker nodes are still being referenced.

What you expected to happen:
References to old worker nodes to be cleaned up - including VolumeAttachments

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster
  2. Assign a PersistentVolume to a workload
  3. Update the EKS Anywhere cluster spec to a new template (or something that would trigger a node replacement)
  4. Post upgrade - observe the errors in vsphere-csi-driver

Anything else we need to know?:
This seems to be similar to kubernetes-csi/external-attacher#215

Environment:

  • EKS Anywhere Release: <unk, where do I pull this my CLI version is 0.66.0>
  • EKS Distro Release: v1.21.2-eks-1-21-5
@smarsh-tim
Copy link
Author

It seems what's causing the issue is non-graceful/aggressive shutdown of existing worker nodes before the volumes are able to un-attach: kubernetes/enhancements#1116

But then the trade-off would be a longer cluster upgrade time.

@smarsh-tim
Copy link
Author

smarsh-tim commented Dec 30, 2021

Work-around for the stale VolumeAttachments, here's how to delete them without using the csi driver:

% kubectl get VolumeAttachments | grep dev-md-0-6f5f5c955-rghzs                                                            
csi-17ea8261329e1e8e85160cad351f24294ce52a59efd1a222a464cc73b4184518   csi.vsphere.vmware.com   pvc-4143115b-dbd2-4a19-a114-1e871e9b851e  dev-md-0-6f5f5c955-rghzs    true       30d
csi-7f6a00400ffb52429987f76ade13938a8e3f558214764d03a921e4c5b6b2bb30   csi.vsphere.vmware.com   pvc-200e1a3c-8974-4bca-882c-92ca1a0aa0e0  dev-md-0-6f5f5c955-rghzs    true       47d
csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32   csi.vsphere.vmware.com   pvc-a00a5328-93e1-4eaf-b7b6-ad2e5a362ee0  dev-md-0-6f5f5c955-rghzs    true       30d
csi-bfafbc70d94ab32121f4ba1374462c689d6e8546191410a034af5f5f35f4d076   csi.vsphere.vmware.com   pvc-2227fc30-f155-4458-89db-9dcbb81a5927  dev-md-0-6f5f5c955-rghzs    true       43h
% kubectl edit VolumeAttachment csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: dev-md-0-6f5f5c955-rghzs
  creationTimestamp: "2021-11-29T20:54:10Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-12-28T22:45:04Z"
  # DELETE finalizers:
  # DELETE - external-attacher/csi-vsphere-vmware-com
  name: csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32

After cleaning up the stale Volume Attachments, the controllers stopped having leadership election issues.

@g-gaston g-gaston self-assigned this Jan 3, 2022
@g-gaston
Copy link
Member

g-gaston commented Jan 3, 2022

Looking into this. Thanks @smarsh-tim !

@g-gaston
Copy link
Member

We haven't got to this yet. I'm putting it in the backlog for further investigation.

@g-gaston g-gaston removed their assignment Jan 19, 2022
@g-gaston g-gaston added this to the backlog milestone Jan 19, 2022
@jaxesn jaxesn added the external An issue, bug or feature request filed from outside the AWS org label Apr 25, 2022
@echel0n
Copy link

echel0n commented Oct 10, 2022

this issue is becoming more serious, I'm finding that randomly pods can not be created as I am getting a volume is current in use error on container creation when upgrading deployment images as well.

Warning FailedAttachVolume 17m attachdetach-controller AttachVolume.Attach failed for volume "pvc-e1610bdf-5420-444e-99bd-a274642b66e7" : rpc error: code = Internal desc = failed to attach disk: "6b284a9a-582d-4ecc-bb94-bf49017e587a" with node: "192.168.22.161" err failed to attach cns volume: "6b284a9a-582d-4ecc-bb94-bf49017e587a" to node vm: "VirtualMachine:vm-5190 [VirtualCenterHost: vcenter.vsphere.xxx.xxx UUID: 421d6be9-800f-38ca-557d-4fa14b7895a9, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.vsphere.xxx.xxx]]". fault: "(*types.LocalizedMethodFault)(0xc000c4f8e0)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (*types.ResourceInUse)(0xc000c6d140)({\n VimFault: (types.VimFault) {\n MethodFault: (types.MethodFault) {\n FaultCause: (*types.LocalizedMethodFault)(<nil>),\n FaultMessage: ([]types.LocalizableMessage) <nil>\n }\n },\n Type: (string) \"\",\n Name: (string) (len=6) \"volume\"\n }),\n LocalizedMessage: (string) (len=32) \"The resource 'volume' is in use.\"\n})\n". opId: "101159e7"

@echel0n
Copy link

echel0n commented Oct 13, 2022

this is getting worse and worse each day, every time a pod gets restarted its a 50/50 chance that it can create the container cause the volume shows in use, its becoming to a point were EKS-A is not usable in a production state, when will this be fixed ?

@echel0n
Copy link

echel0n commented Oct 13, 2022

I'll also note that the workaround in this ticket works only if I also detach the cns volume using vcenter mob api, but this is a very lengthy process that also works some of the time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external An issue, bug or feature request filed from outside the AWS org
Projects
None yet
Development

No branches or pull requests

4 participants