CLI stalls on upgrade with Flux Gitops when applying eksa yaml resources to cluster during upgrade #6453

cxbrowne1207 · 2023-08-09T21:43:36Z

During upgrade, the CLI attempts to force apply k --force apply the Cluster spec at the "Applying eksa yaml resources to cluster" step. If there is some kind of validation error in the cluster webook at this point, it will stall indefinitely until it eventually times out. Normally, the k --force apply would delete the resource and just recreate it, but instead it the CLI gets stuck. This situation has primarily been observed when using Flux.

At this stage, the eksa-controller-manager outputs the following logs:

{"ts":1690488485167.4885,"caller":"controllers/cluster_controller.go:217","msg":"Reconciling cluster","v":0,"controller":"cluster","controllerGroup":"anywhere.eks.amazonaws.com","controllerKind":"Cluster","Cluster":{"name":"eksa-test-ae88bec","namespace":"default"},"namespace":"default","name":"eksa-test-ae88bec","reconcileID":"199cb160-1384-479f-b9f7-32dabcd6129e"}
{"ts":1690488485167.728,"caller":"controllers/cluster_controller.go:418","msg":"Updating cluster status","v":0,"controller":"cluster","controllerGroup":"anywhere.eks.amazonaws.com","controllerKind":"Cluster","Cluster":{"name":"eksa-test-ae88bec","namespace":"default"},"namespace":"default","name":"eksa-test-ae88bec","reconcileID":"199cb160-1384-479f-b9f7-32dabcd6129e"}
{"ts":1690488485168.4016,"caller":"controller/controller.go:329","msg":"Reconciler error","controller":"cluster","controllerGroup":"anywhere.eks.amazonaws.com","controllerKind":"Cluster","Cluster":{"name":"eksa-test-ae88bec","namespace":"default"},"namespace":"default","name":"eksa-test-ae88bec","reconcileID":"199cb160-1384-479f-b9f7-32dabcd6129e","err":"deleting self-managed clusters is not supported","errVerbose":"deleting self-managed clusters is not supported\ngithub.com/aws/eks-anywhere/controllers.(*ClusterReconciler).reconcileDelete\n\tgithub.com/aws/eks-anywhere/controllers/cluster_controller.go:444\ngithub.com/aws/eks-anywhere/controllers.(*ClusterReconciler).Reconcile\n\tgithub.com/aws/eks-anywhere/controllers/cluster_controller.go:262\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nruntime.goexit\n\truntime/asm_amd64.s:1598"}

Example Scenario

When upgrading from the latest minor release to v0.17.0, the CLI stalled while "Applying eksa yaml resources to cluster". In this case, both the newly added eksaVersion field and the bundlesRef were populated in submission to the kubeapi server, however, there is a web-hook validation that does not allow both to be set at the same time, so that was throwing an error in the kubeapi server pod. This issue was roughly patched by creating ClusterSpecGenerate so that we could generate a yaml where bundlesRef null so that it would not be omitted when submitted to the kubeapi server.

kubeapi-server pod

W0727 21:34:57.770016       1 dispatcher.go:216] rejected by webhook "validation.cluster.anywhere.amazonaws.com": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"validation.cluster.anywhere.amazonaws.com\" denied the request: Cluster.anywhere.eks.amazonaws.com \"eksa-test-ae88bec\" is invalid: spec: Invalid value: v1alpha1.ClusterSpec{KubernetesVersion:\"1.27\", ControlPlaneConfiguration:v1alpha1.ControlPlaneConfiguration{Count:1, Endpoint:(*v1alpha1.Endpoint)(nil), MachineGroupRef:(*v1alpha1.Ref)(nil), Taints:[]v1.Taint(nil), Labels:map[string]string(nil), UpgradeRolloutStrategy:(*v1alpha1.ControlPlaneUpgradeRolloutStrategy)(nil), SkipLoadBalancerDeployment:false}, WorkerNodeGroupConfigurations:[]v1alpha1.WorkerNodeGroupConfiguration{v1alpha1.WorkerNodeGroupConfiguration{Name:\"md-0\", Count:(*int)(0xc000f8ddc0), AutoScalingConfiguration:(*v1alpha1.AutoScalingConfiguration)(nil), MachineGroupRef:(*v1alpha1.Ref)(nil), Taints:[]v1.Taint(nil), Labels:map[string]string(nil), UpgradeRolloutStrategy:(*v1alpha1.WorkerNodesUpgradeRolloutStrategy)(nil), KubernetesVersion:(*v1alpha1.KubernetesVersion)(nil)}}, DatacenterRef:v1alpha1.Ref{Kind:\"DockerDatacenterConfig\", Name:\"eksa-test-ae88bec\"}, IdentityProviderRefs:[]v1alpha1.Ref(nil), GitOpsRef:(*v1alpha1.Ref)(0xc000a6e040), ClusterNetwork:v1alpha1.ClusterNetwork{Pods:v1alpha1.Pods{CidrBlocks:[]string{\"192.168.0.0/16\"}}, Services:v1alpha1.Services{CidrBlocks:[]string{\"10.96.0.0/12\"}}, CNI:\"\", CNIConfig:(*v1alpha1.CNIConfig)(0xc000fc7a20), DNS:v1alpha1.DNS{ResolvConf:(*v1alpha1.ResolvConf)(nil)}, Nodes:(*v1alpha1.Nodes)(nil)}, ExternalEtcdConfiguration:(*v1alpha1.ExternalEtcdConfiguration)(0xc000fc7a40), ProxyConfiguration:(*v1alpha1.ProxyConfiguration)(nil), RegistryMirrorConfiguration:(*v1alpha1.RegistryMirrorConfiguration)(nil), ManagementCluster:v1alpha1.ManagementCluster{Name:\"eksa-test-ae88bec\"}, PodIAMConfig:(*v1alpha1.PodIAMConfig)(nil), Packages:(*v1alpha1.PackageConfiguration)(nil), BundlesRef:(*v1alpha1.BundlesRef)(0xc000fce960), EksaVersion:(*v1alpha1.EksaVersion)(0xc000fc7a30)}: cannot pass both bundlesRef and eksaVersion. New clusters should use eksaVersion instead of bundlesRef", Reason:"Invalid", Details:(*v1.StatusDetails)(0xc016d9d260), Code:422}}

This is strange because, in the original submission of the Cluster resource yaml, bundlesRef was nil.

cluster.yaml

kind: Cluster
metadata:
  annotations:
    anywhere.eks.amazonaws.com/paused: "true"
  name: eksa-test-ae88bec
  namespace: default
spec:
  clusterNetwork:
    cniConfig:
      kindnetd: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
  datacenterRef:
    kind: DockerDatacenterConfig
    name: eksa-test-ae88bec
  eksaVersion: v0.0.0-dev-release-0.17+build.47
  externalEtcdConfiguration:
    count: 1
  gitOpsRef:
    kind: FluxConfig
    name: eksa-test-rhari
  kubernetesVersion: "1.27"
  managementCluster:
    name: eksa-test-ae88bec
  workerNodeGroupConfigurations:
  - count: 1
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: DockerDatacenterConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/paused: "true"
  name: eksa-test-ae88bec
  namespace: default
spec: {}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: FluxConfig
metadata:
  name: eksa-test-rhari
  namespace: default
spec:
  branch: main
  clusterConfigPath: path2
  github:
    owner: cxbrowne1207
    personal: true
    repository: eksa-test-gitops-flux-test
  systemNamespace: default

---

The text was updated successfully, but these errors were encountered:

csplinter added the area/upgrades label Aug 24, 2023

jiayiwang7 added this to the v0.18.0 milestone Sep 28, 2023

cxbrowne1207 mentioned this issue Oct 6, 2023

Update CLI to apply resources to using regular kubectl apply instead of with force #6784

Merged

drewvanstone modified the milestones: v0.18.0, v0.19.0 Oct 9, 2023

cxbrowne1207 self-assigned this Oct 9, 2023

jiayiwang7 modified the milestones: v0.19.0, v0.18.0 Oct 12, 2023

jiayiwang7 closed this as completed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI stalls on upgrade with Flux Gitops when applying eksa yaml resources to cluster during upgrade #6453

CLI stalls on upgrade with Flux Gitops when applying eksa yaml resources to cluster during upgrade #6453

cxbrowne1207 commented Aug 9, 2023 •

edited

Loading

CLI stalls on upgrade with Flux Gitops when applying eksa yaml resources to cluster during upgrade #6453

CLI stalls on upgrade with Flux Gitops when applying eksa yaml resources to cluster during upgrade #6453

Comments

cxbrowne1207 commented Aug 9, 2023 • edited Loading

cxbrowne1207 commented Aug 9, 2023 •

edited

Loading