Document steps to upgrade management and workload clusters with new thumbprint #8042

sp1999 · 2024-04-24T17:17:25Z

Problem Statement:
Customer recently rotated their certificates for the vsphere datacenter which were about to be expired. This updated their datacenter with a new thumbprint. They are interested in knowing how to update their clusters to use the new thumbprint for the vsphere datacenter config.

Issue with running eksctl anywhere upgrade command:
When we run upgrade command on management cluster with new thumbprint, the eks-a controller does update the mgmt cluster's vspheredatacenterconfig object with the new thumbprint but the workload cluster's vspheredatacenterconfig object still has the old thumbprint so the vspheredatacenter reconciler will throw the following thumbprint mismatch error in the eks-a controller logs.

Error message:

{
    "ts":1713293290685.779,
    "caller":"controllers/vsphere_datacenter_controller.go:93",
    "msg":"Failed to validate VsphereDatacenterConfig",
    "controller":"vspheredatacenterconfig",
    "controllerGroup":"anywhere.eks.amazonaws.com",
    "controllerKind":"VSphereDatacenterConfig",
    "VSphereDatacenterConfig":{
        "name":"w01",
        "namespace":"default"
    },
    "namespace":"default",
    "name":"w01",
    "reconcileID":"f44c1366-80c9-4c41-82a0-e819c5109fc0",
    "err":"thumbprint mismatch detected, expected: {old-thumbprint}, actual: {new-thumbprint}"
}

Solution:
From EKS-A v0.17.0 onwards, vSphere datacenter config thumbprint field has been made mutable for upgrades. But KinD-less upgrades was introduced in v0.18.0. In order to update all the existing clusters with the new thumbprint, you need to follow the following steps depending on your cluster's EKS-A version:

For all EKS-A versions:

Before starting the upgrade process, please take the backup of your management cluster as well as all workload clusters by following the steps documented here.

For EKS-A v0.18.0 and above:

Pause the eks-a cluster controller for all the workload clusters only:

export KUBECONFIG=mgmt/mgmt-eks-a-cluster.kubeconfig
kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused=true
kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused=true

Create a vSphereDatacenterConfig manifest file with the exact same vspheredatacenterconfig for existing management cluster as well as all the workload clusters except that the thumbprint should be the new thumbprint:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: mgmt
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w01
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w02
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

Apply the above manifest file to the management cluster:

kubectl apply -f {manifest-file-name}.yaml

Verify that the new machines are being rolled out only for the management cluster:

kubectl get machines -A -w

If any machine is stuck in the Provisioning phase, restart the capv controller manager pod:

kubectl delete --force -n capv-system pod capv-controller-manager-84bdf678db-kdvx8

Once all the machines are updated, verify that the thumbprint field is updated in the config maps and objects for management cluster:

kubectl get cm -n kube-system vsphere-cloud-config -o yaml
kubectl get cm -n eksa-system mgmt-cpi-manifests -o yaml
kubectl get vspheredatacenterconfigs.anywhere.eks.amazonaws.com mgmt -o yaml
kubectl get vsphereclusters.infrastructure.cluster.x-k8s.io -n eksa-system mgmt -oyaml
kubectl get vspherevms.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-65cv5 -oyaml

Verify that vsphere machine templates and VMs are created with the new thumbprint for the management cluster:

kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system
kubectl get vspheremachines.infrastructure.cluster.x-k8s.io -n eksa-system

Unpause the eks-a cluster controller for each of the workload clusters one by one:

kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused-

Repeat steps 4,5,6,7 for w01

kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused-

Repeat steps 4,5,6,7 for w02

For EKS-A v0.17.x:

Pause the eks-a cluster controller for both the management cluster as well as all workload clusters:

export KUBECONFIG=mgmt/mgmt-eks-a-cluster.kubeconfig
kubectl annotate cluster mgmt anywhere.eks.amazonaws.com/paused=true
kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused=true
kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused=true

Create a vSphereDatacenterConfig manifest file with the exact same vspheredatacenterconfig for existing management cluster as well as all the workload clusters except that the thumbprint should be the new thumbprint:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: mgmt
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w01
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w02
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

Apply the above manifest file to the management cluster:

kubectl apply -f {manifest-file-name}.yaml

Update the vsphere-cloud-config config map with the new thumbprint:

kubectl edit configmaps -n kube-system vsphere-cloud-config

Create vspheremachinetemplate manifest files with the new thumbprint for etcd, control plane and worker machines:

kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-etcd-template-{most-recent-template-number} -oyaml > etcd-template.yaml
kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-control-plane-template-{most-recent-template-number} -oyaml > cp-template.yaml
kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-md-0-{most-recent-template-number} -oyaml > md-template.yaml

Update the spec.template.spec.thumbprint field with the new thumbprint and also update the metadata.name field with new template number in the above manifest files:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
  ...
  name: mgmt-control-plane-template-{previous-template-number-plus-one}
  namespace: eksa-system
  ...
spec:
  template:
    spec:
      ...
      thumbprint: {new-thumbprint}
      ...

Apply the above manifest files to the management cluster:

kubectl apply -f etcd-template.yaml
kubectl apply -f cp-template.yaml
kubectl apply -f md-template.yaml

Modify the spec.infrastructureTemplate.name field in the EtcdadmCluster object to point to the new etcd-template for the management cluster:

kubectl edit etcdadmclusters.etcdcluster.cluster.x-k8s.io -n eksa-system mgmt-etcd

Modify the spec.machineTemplate.infrastructureRef.name field in the KubeadmControlPlane object to point to the new cp-template for the management cluster:

kubectl edit kubeadmcontrolplanes.controlplane.cluster.x-k8s.io -n eksa-system mgmt

Modify the spec.template.spec.infrastructureRef.name field in the MachineDeployment object to point to the new md-template for the management cluster:

kubectl edit machinedeployments.cluster.x-k8s.io -n eksa-system mgmt-md-0

Update the spec.thumbprint field in the VSphereCluster object with the new thumbprint:

kubectl edit vsphereclusters.infrastructure.cluster.x-k8s.io -n eksa-system mgmt

Verify that the new machines are being rolled out for only the management cluster:

kubectl get machines -A -w

If any machine is stuck in the Provisioning phase, restart the capv controller manager pod:

kubectl delete --force -n capv-system pod capv-controller-manager-84bdf678db-kdvx8

Unpause the eks-a cluster controller for the management cluster:

kubectl annotate cluster mgmt anywhere.eks.amazonaws.com/paused-

Finally unpause the eks-a cluster controller for each of the workload clusters one by one:

kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused-

Wait until all the new machines are rolled out with new thumbprint for w01

kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused-

Wait until all the new machines are rolled out with new thumbprint for w02

The text was updated successfully, but these errors were encountered:

jiayiwang7 assigned sp1999 Apr 24, 2024

jiayiwang7 added this to the v0.19.0 milestone Apr 24, 2024

jiayiwang7 mentioned this issue Apr 24, 2024

Properly handle vSphere thumbprint updates #8043

Open

sp1999 closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document steps to upgrade management and workload clusters with new thumbprint #8042

Document steps to upgrade management and workload clusters with new thumbprint #8042

sp1999 commented Apr 24, 2024 •

edited

Loading

Document steps to upgrade management and workload clusters with new thumbprint #8042

Document steps to upgrade management and workload clusters with new thumbprint #8042

Comments

sp1999 commented Apr 24, 2024 • edited Loading

sp1999 commented Apr 24, 2024 •

edited

Loading