Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI stalls on upgrade with Flux Gitops when applying eksa yaml resources to cluster during upgrade #6453

Closed
cxbrowne1207 opened this issue Aug 9, 2023 · 0 comments
Assignees
Milestone

Comments

@cxbrowne1207
Copy link
Member

cxbrowne1207 commented Aug 9, 2023

During upgrade, the CLI attempts to force apply k --force apply the Cluster spec at the "Applying eksa yaml resources to cluster" step. If there is some kind of validation error in the cluster webook at this point, it will stall indefinitely until it eventually times out. Normally, the k --force apply would delete the resource and just recreate it, but instead it the CLI gets stuck. This situation has primarily been observed when using Flux.

At this stage, the eksa-controller-manager outputs the following logs:

{"ts":1690488485167.4885,"caller":"controllers/cluster_controller.go:217","msg":"Reconciling cluster","v":0,"controller":"cluster","controllerGroup":"anywhere.eks.amazonaws.com","controllerKind":"Cluster","Cluster":{"name":"eksa-test-ae88bec","namespace":"default"},"namespace":"default","name":"eksa-test-ae88bec","reconcileID":"199cb160-1384-479f-b9f7-32dabcd6129e"}
{"ts":1690488485167.728,"caller":"controllers/cluster_controller.go:418","msg":"Updating cluster status","v":0,"controller":"cluster","controllerGroup":"anywhere.eks.amazonaws.com","controllerKind":"Cluster","Cluster":{"name":"eksa-test-ae88bec","namespace":"default"},"namespace":"default","name":"eksa-test-ae88bec","reconcileID":"199cb160-1384-479f-b9f7-32dabcd6129e"}
{"ts":1690488485168.4016,"caller":"controller/controller.go:329","msg":"Reconciler error","controller":"cluster","controllerGroup":"anywhere.eks.amazonaws.com","controllerKind":"Cluster","Cluster":{"name":"eksa-test-ae88bec","namespace":"default"},"namespace":"default","name":"eksa-test-ae88bec","reconcileID":"199cb160-1384-479f-b9f7-32dabcd6129e","err":"deleting self-managed clusters is not supported","errVerbose":"deleting self-managed clusters is not supported\ngithub.com/aws/eks-anywhere/controllers.(*ClusterReconciler).reconcileDelete\n\tgithub.com/aws/eks-anywhere/controllers/cluster_controller.go:444\ngithub.com/aws/eks-anywhere/controllers.(*ClusterReconciler).Reconcile\n\tgithub.com/aws/eks-anywhere/controllers/cluster_controller.go:262\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235\nruntime.goexit\n\truntime/asm_amd64.s:1598"}

Example Scenario

When upgrading from the latest minor release to v0.17.0, the CLI stalled while "Applying eksa yaml resources to cluster". In this case, both the newly added eksaVersion field and the bundlesRef were populated in submission to the kubeapi server, however, there is a web-hook validation that does not allow both to be set at the same time, so that was throwing an error in the kubeapi server pod. This issue was roughly patched by creating ClusterSpecGenerate so that we could generate a yaml where bundlesRef null so that it would not be omitted when submitted to the kubeapi server.

kubeapi-server pod

W0727 21:34:57.770016       1 dispatcher.go:216] rejected by webhook "validation.cluster.anywhere.amazonaws.com": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"validation.cluster.anywhere.amazonaws.com\" denied the request: Cluster.anywhere.eks.amazonaws.com \"eksa-test-ae88bec\" is invalid: spec: Invalid value: v1alpha1.ClusterSpec{KubernetesVersion:\"1.27\", ControlPlaneConfiguration:v1alpha1.ControlPlaneConfiguration{Count:1, Endpoint:(*v1alpha1.Endpoint)(nil), MachineGroupRef:(*v1alpha1.Ref)(nil), Taints:[]v1.Taint(nil), Labels:map[string]string(nil), UpgradeRolloutStrategy:(*v1alpha1.ControlPlaneUpgradeRolloutStrategy)(nil), SkipLoadBalancerDeployment:false}, WorkerNodeGroupConfigurations:[]v1alpha1.WorkerNodeGroupConfiguration{v1alpha1.WorkerNodeGroupConfiguration{Name:\"md-0\", Count:(*int)(0xc000f8ddc0), AutoScalingConfiguration:(*v1alpha1.AutoScalingConfiguration)(nil), MachineGroupRef:(*v1alpha1.Ref)(nil), Taints:[]v1.Taint(nil), Labels:map[string]string(nil), UpgradeRolloutStrategy:(*v1alpha1.WorkerNodesUpgradeRolloutStrategy)(nil), KubernetesVersion:(*v1alpha1.KubernetesVersion)(nil)}}, DatacenterRef:v1alpha1.Ref{Kind:\"DockerDatacenterConfig\", Name:\"eksa-test-ae88bec\"}, IdentityProviderRefs:[]v1alpha1.Ref(nil), GitOpsRef:(*v1alpha1.Ref)(0xc000a6e040), ClusterNetwork:v1alpha1.ClusterNetwork{Pods:v1alpha1.Pods{CidrBlocks:[]string{\"192.168.0.0/16\"}}, Services:v1alpha1.Services{CidrBlocks:[]string{\"10.96.0.0/12\"}}, CNI:\"\", CNIConfig:(*v1alpha1.CNIConfig)(0xc000fc7a20), DNS:v1alpha1.DNS{ResolvConf:(*v1alpha1.ResolvConf)(nil)}, Nodes:(*v1alpha1.Nodes)(nil)}, ExternalEtcdConfiguration:(*v1alpha1.ExternalEtcdConfiguration)(0xc000fc7a40), ProxyConfiguration:(*v1alpha1.ProxyConfiguration)(nil), RegistryMirrorConfiguration:(*v1alpha1.RegistryMirrorConfiguration)(nil), ManagementCluster:v1alpha1.ManagementCluster{Name:\"eksa-test-ae88bec\"}, PodIAMConfig:(*v1alpha1.PodIAMConfig)(nil), Packages:(*v1alpha1.PackageConfiguration)(nil), BundlesRef:(*v1alpha1.BundlesRef)(0xc000fce960), EksaVersion:(*v1alpha1.EksaVersion)(0xc000fc7a30)}: cannot pass both bundlesRef and eksaVersion. New clusters should use eksaVersion instead of bundlesRef", Reason:"Invalid", Details:(*v1.StatusDetails)(0xc016d9d260), Code:422}}

This is strange because, in the original submission of the Cluster resource yaml, bundlesRef was nil.

cluster.yaml

kind: Cluster
metadata:
  annotations:
    anywhere.eks.amazonaws.com/paused: "true"
  name: eksa-test-ae88bec
  namespace: default
spec:
  clusterNetwork:
    cniConfig:
      kindnetd: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
  datacenterRef:
    kind: DockerDatacenterConfig
    name: eksa-test-ae88bec
  eksaVersion: v0.0.0-dev-release-0.17+build.47
  externalEtcdConfiguration:
    count: 1
  gitOpsRef:
    kind: FluxConfig
    name: eksa-test-rhari
  kubernetesVersion: "1.27"
  managementCluster:
    name: eksa-test-ae88bec
  workerNodeGroupConfigurations:
  - count: 1
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: DockerDatacenterConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/paused: "true"
  name: eksa-test-ae88bec
  namespace: default
spec: {}

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: FluxConfig
metadata:
  name: eksa-test-rhari
  namespace: default
spec:
  branch: main
  clusterConfigPath: path2
  github:
    owner: cxbrowne1207
    personal: true
    repository: eksa-test-gitops-flux-test
  systemNamespace: default

---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants