Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document steps to upgrade management and workload clusters with new thumbprint #8042

Closed
sp1999 opened this issue Apr 24, 2024 · 0 comments
Closed
Assignees
Milestone

Comments

@sp1999
Copy link
Member

sp1999 commented Apr 24, 2024

Problem Statement:
Customer recently rotated their certificates for the vsphere datacenter which were about to be expired. This updated their datacenter with a new thumbprint. They are interested in knowing how to update their clusters to use the new thumbprint for the vsphere datacenter config.

Issue with running eksctl anywhere upgrade command:
When we run upgrade command on management cluster with new thumbprint, the eks-a controller does update the mgmt cluster's vspheredatacenterconfig object with the new thumbprint but the workload cluster's vspheredatacenterconfig object still has the old thumbprint so the vspheredatacenter reconciler will throw the following thumbprint mismatch error in the eks-a controller logs.

Error message:

{
    "ts":1713293290685.779,
    "caller":"controllers/vsphere_datacenter_controller.go:93",
    "msg":"Failed to validate VsphereDatacenterConfig",
    "controller":"vspheredatacenterconfig",
    "controllerGroup":"anywhere.eks.amazonaws.com",
    "controllerKind":"VSphereDatacenterConfig",
    "VSphereDatacenterConfig":{
        "name":"w01",
        "namespace":"default"
    },
    "namespace":"default",
    "name":"w01",
    "reconcileID":"f44c1366-80c9-4c41-82a0-e819c5109fc0",
    "err":"thumbprint mismatch detected, expected: {old-thumbprint}, actual: {new-thumbprint}"
}

Solution:
From EKS-A v0.17.0 onwards, vSphere datacenter config thumbprint field has been made mutable for upgrades. But KinD-less upgrades was introduced in v0.18.0. In order to update all the existing clusters with the new thumbprint, you need to follow the following steps depending on your cluster's EKS-A version:

For all EKS-A versions:

Before starting the upgrade process, please take the backup of your management cluster as well as all workload clusters by following the steps documented here.

For EKS-A v0.18.0 and above:

  1. Pause the eks-a cluster controller for all the workload clusters only:
export KUBECONFIG=mgmt/mgmt-eks-a-cluster.kubeconfig
kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused=true
kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused=true
  1. Create a vSphereDatacenterConfig manifest file with the exact same vspheredatacenterconfig for existing management cluster as well as all the workload clusters except that the thumbprint should be the new thumbprint:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: mgmt
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w01
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w02
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}
  1. Apply the above manifest file to the management cluster:
kubectl apply -f {manifest-file-name}.yaml
  1. Verify that the new machines are being rolled out only for the management cluster:
kubectl get machines -A -w
  1. If any machine is stuck in the Provisioning phase, restart the capv controller manager pod:
kubectl delete --force -n capv-system pod capv-controller-manager-84bdf678db-kdvx8
  1. Once all the machines are updated, verify that the thumbprint field is updated in the config maps and objects for management cluster:
kubectl get cm -n kube-system vsphere-cloud-config -o yaml
kubectl get cm -n eksa-system mgmt-cpi-manifests -o yaml
kubectl get vspheredatacenterconfigs.anywhere.eks.amazonaws.com mgmt -o yaml
kubectl get vsphereclusters.infrastructure.cluster.x-k8s.io -n eksa-system mgmt -oyaml
kubectl get vspherevms.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-65cv5 -oyaml
  1. Verify that vsphere machine templates and VMs are created with the new thumbprint for the management cluster:
kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system
kubectl get vspheremachines.infrastructure.cluster.x-k8s.io -n eksa-system
  1. Unpause the eks-a cluster controller for each of the workload clusters one by one:
kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused-

Repeat steps 4,5,6,7 for w01

kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused-

Repeat steps 4,5,6,7 for w02

For EKS-A v0.17.x:

  1. Pause the eks-a cluster controller for both the management cluster as well as all workload clusters:
export KUBECONFIG=mgmt/mgmt-eks-a-cluster.kubeconfig
kubectl annotate cluster mgmt anywhere.eks.amazonaws.com/paused=true
kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused=true
kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused=true
  1. Create a vSphereDatacenterConfig manifest file with the exact same vspheredatacenterconfig for existing management cluster as well as all the workload clusters except that the thumbprint should be the new thumbprint:
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: mgmt
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w01
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: w02
spec:
  datacenter: datacenter1
  insecure: false
  network: "/datacenter1/network/network1"
  server: {vcenter-url}
  thumbprint: {new-thumbprint}
  1. Apply the above manifest file to the management cluster:
kubectl apply -f {manifest-file-name}.yaml
  1. Update the vsphere-cloud-config config map with the new thumbprint:
kubectl edit configmaps -n kube-system vsphere-cloud-config
  1. Create vspheremachinetemplate manifest files with the new thumbprint for etcd, control plane and worker machines:
kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-etcd-template-{most-recent-template-number} -oyaml > etcd-template.yaml
kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-control-plane-template-{most-recent-template-number} -oyaml > cp-template.yaml
kubectl get vspheremachinetemplates.infrastructure.cluster.x-k8s.io -n eksa-system mgmt-md-0-{most-recent-template-number} -oyaml > md-template.yaml
  1. Update the spec.template.spec.thumbprint field with the new thumbprint and also update the metadata.name field with new template number in the above manifest files:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
  ...
  name: mgmt-control-plane-template-{previous-template-number-plus-one}
  namespace: eksa-system
  ...
spec:
  template:
    spec:
      ...
      thumbprint: {new-thumbprint}
      ...
  1. Apply the above manifest files to the management cluster:
kubectl apply -f etcd-template.yaml
kubectl apply -f cp-template.yaml
kubectl apply -f md-template.yaml
  1. Modify the spec.infrastructureTemplate.name field in the EtcdadmCluster object to point to the new etcd-template for the management cluster:
kubectl edit etcdadmclusters.etcdcluster.cluster.x-k8s.io -n eksa-system mgmt-etcd
  1. Modify the spec.machineTemplate.infrastructureRef.name field in the KubeadmControlPlane object to point to the new cp-template for the management cluster:
kubectl edit kubeadmcontrolplanes.controlplane.cluster.x-k8s.io -n eksa-system mgmt
  1. Modify the spec.template.spec.infrastructureRef.name field in the MachineDeployment object to point to the new md-template for the management cluster:
kubectl edit machinedeployments.cluster.x-k8s.io -n eksa-system mgmt-md-0
  1. Update the spec.thumbprint field in the VSphereCluster object with the new thumbprint:
kubectl edit vsphereclusters.infrastructure.cluster.x-k8s.io -n eksa-system mgmt
  1. Verify that the new machines are being rolled out for only the management cluster:
kubectl get machines -A -w
  1. If any machine is stuck in the Provisioning phase, restart the capv controller manager pod:
kubectl delete --force -n capv-system pod capv-controller-manager-84bdf678db-kdvx8
  1. Unpause the eks-a cluster controller for the management cluster:
kubectl annotate cluster mgmt anywhere.eks.amazonaws.com/paused-
  1. Finally unpause the eks-a cluster controller for each of the workload clusters one by one:
kubectl annotate cluster w01 anywhere.eks.amazonaws.com/paused-

Wait until all the new machines are rolled out with new thumbprint for w01

kubectl annotate cluster w02 anywhere.eks.amazonaws.com/paused-

Wait until all the new machines are rolled out with new thumbprint for w02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants