CloudStack upgrade e2e test is failing with timeout on move #1888

maxdrib · 2022-04-20T16:02:04Z

What happened:
Error message displayed below:

Error: failed to upgrade cluster: moving CAPI management from source to target: failed moving management cluster: Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Moving Cluster API objects ClusterClasses=0
Creating objects in the target cluster
Error: action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=Cluster" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook "default.cluster.cluster.x-k8s.io": Post "[https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s](https://capi-webhook-service.capi-system.svc/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s)": context deadline exceeded

What you expected to happen:
Upgrade should succeed

How to reproduce it (as minimally and precisely as possible):
Execute the TestCloudStackKubernetes120RedhatTo121Upgrade e2e test again the code in main

Anything else we need to know?:

Environment:

EKS Anywhere Release:
EKS Distro Release:

cluster.yaml.zip

The text was updated successfully, but these errors were encountered:

maxdrib · 2022-04-20T16:16:01Z

I was able to successfully perform an upgrade when using an older cluster spec. Comparing the cluster specs, I noticed a difference in the cluster.spec.networkConfig attributes where the working spec had:

  clusterNetwork:
    cni: cilium
    pods: ...
    services: ...

and the broken spec had

  clusterNetwork:
    cniConfig:
      cilium: {}
    dns: {}
    pods: ...
    services: ...

I continued to attempt to isolate the problem and was able to reproduce the issue by adding/removing the dns: {} attribute in the cluster spec. When present, it caused the cluster to hang on upgrade. Unfortunately, each test run takes ~45 minutes so it is tedious to test changes.

The dns attribute in clusterConfig is not present by default when running generate clusterconfig, as it is explicitly removed for cleanliness here. However, with the marshalling and unmarshalling done by the e2e tests, it is re-added and appears to be causing failures.

maxdrib · 2022-04-20T19:03:29Z

Additional logs with running clusterctl -v=9 on move

2022-04-20T14:56:50.779-0400	V4	Task start	{"task_name": "capi-management-move-to-workload"}
2022-04-20T14:56:50.779-0400	V0	Moving cluster management from bootstrap to workload cluster
2022-04-20T14:56:50.779-0400	V3	Waiting for management machines to be ready before move
2022-04-20T14:56:51.266-0400	V4	Nodes ready	{"total": 3}
2022-04-20T14:57:45.161-0400	V4	Task finished	{"task_name": "capi-management-move-to-workload", "duration": "54.379619357s"}
2022-04-20T14:57:45.161-0400	V4	----------------------------------
2022-04-20T14:57:45.161-0400	V4	Task start	{"task_name": "collect-cluster-diagnostics"}
2022-04-20T14:57:45.161-0400	V0	collecting cluster diagnostics
2022-04-20T14:57:45.161-0400	V0	collecting management cluster diagnostics
2022-04-20T14:57:45.170-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:57:45-04:00-bundle.yaml"}
2022-04-20T14:57:45.170-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-20T14:57:45.603-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-20T14:57:46.582-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "bootstrap-cluster", "bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:57:45-04:00-bundle.yaml", "since": "2022-04-20T13:57:45.170-0400", "kubeconfig": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29.kind.kubeconfig"}
2022-04-20T14:59:47.731-0400	V0	Support bundle archive created	{"path": "support-bundle-2022-04-20T18_57_47.tar.gz"}
2022-04-20T14:59:47.731-0400	V0	Analyzing support bundle	{"bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:57:45-04:00-bundle.yaml", "archive": "support-bundle-2022-04-20T18_57_47.tar.gz"}
2022-04-20T14:59:48.389-0400	V0	Analysis output generated	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:59:48-04:00-analysis.yaml"}
2022-04-20T14:59:48.389-0400	V1	cleaning up temporary roles for diagnostic collectors
2022-04-20T14:59:48.789-0400	V1	cleaning up temporary namespace  for diagnostic collectors	{"namespace": "eksa-diagnostics"}
2022-04-20T14:59:54.921-0400	V0	collecting workload cluster diagnostics
2022-04-20T14:59:54.925-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T14:59:54-04:00-bundle.yaml"}
2022-04-20T14:59:54.925-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-20T14:59:55.359-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-20T14:59:57.627-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "eksa-test-2ffeb29", "bundle": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T14:59:54-04:00-bundle.yaml", "since": "2022-04-20T13:59:54.925-0400", "kubeconfig": "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.kubeconfig"}
2022-04-20T15:01:59.961-0400	V0	Support bundle archive created	{"path": "support-bundle-2022-04-20T18_59_58.tar.gz"}
2022-04-20T15:01:59.961-0400	V0	Analyzing support bundle	{"bundle": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T14:59:54-04:00-bundle.yaml", "archive": "support-bundle-2022-04-20T18_59_58.tar.gz"}
2022-04-20T15:02:00.656-0400	V0	Analysis output generated	{"path": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T15:02:00-04:00-analysis.yaml"}
2022-04-20T15:02:00.656-0400	V1	cleaning up temporary roles for diagnostic collectors
2022-04-20T15:02:01.184-0400	V1	cleaning up temporary namespace  for diagnostic collectors	{"namespace": "eksa-diagnostics"}
2022-04-20T15:02:06.931-0400	V4	Task finished	{"task_name": "collect-cluster-diagnostics", "duration": "4m21.762248472s"}
2022-04-20T15:02:06.931-0400	V4	----------------------------------
2022-04-20T15:02:06.931-0400	V4	Tasks completed	{"duration": "33m53.267851981s"}
2022-04-20T15:02:06.931-0400	V3	Cleaning up long running container	{"name": "eksa_1650479292198787000"}
Error: failed to upgrade cluster: moving CAPI management from source to target: failed moving management cluster: No default config file available
Performing move...
Discovering Cluster API objects
KubeadmConfig Count=3
KubeadmControlPlane Count=1
CloudStackMachineTemplate Count=4
MachineHealthCheck Count=2
KubeadmConfigTemplate Count=1
MachineSet Count=2
ConfigMap Count=1
CloudStackCluster Count=1
MachineDeployment Count=1
Machine Count=3
Secret Count=9
CloudStackMachine Count=3
Cluster Count=1
Total objects Count=32
Excluding secret from move (not linked with any Cluster) name="default-token-89k7k"
Moving Cluster API objects Clusters=1
Moving Cluster API objects ClusterClasses=0
Pausing the source cluster
Set Cluster.Spec.Paused Paused=true Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Pausing the source cluster classes
Creating target namespaces, if missing
Creating objects in the target cluster
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47812-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47832-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47858-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47890-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47934-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48010-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48098-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48238-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48462-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Error: action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=Cluster" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook "default.cluster.cluster.x-k8s.io": Post "https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": read tcp 172.16.0.48:48772->10.108.124.180:443: read: connection reset by peer

    cluster.go:529: Command eksctl anywhere [upgrade cluster -f eksa-test-2ffeb29-config/cluster.yaml -v 4 --bundles-override bin/local-bundle-release.yaml] failed with error: exit status 255: Error: action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=Cluster" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook "default.cluster.cluster.x-k8s.io": Post "https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": read tcp 172.16.0.48:48772->10.108.124.180:443: read: connection reset by peer
2022-04-20T15:02:07.327-0400	V3	e2e	Cleaning up long running container	{"name": "eksa_1650478924208997000"}
--- FAIL: TestCloudStackKubernetes120RedhatTo121Upgrade (2403.40s)
FAIL
{"error":"missing_field_value","ok":false,"response_metadata":{"messages":["[ERROR] empty required field: 'args'"]}}curl: (3) unmatched close brace/bracket in URL position 3:
0"}

maxdrib · 2022-04-20T20:03:39Z

Logs and resources on the cluster here: support-bundle-2022-04-19T18_28_43.zip

After the move failed, I tried applying the eksa cluster spec manually on the workload cluster, and was able to reproduce the connection reset error

➜  eks-anywhere git:(main) ✗ k apply -f eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.yaml --kubeconfig eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.kubeconfig
cloudstackmachineconfig.anywhere.eks.amazonaws.com/eksa-test-2ffeb29-cp unchanged
cloudstackmachineconfig.anywhere.eks.amazonaws.com/eksa-test-2ffeb29 unchanged
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"anywhere.eks.amazonaws.com/paused":null,"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"anywhere.eks.amazonaws.com/v1alpha1\",\"kind\":\"Cluster\",\"metadata\":{\"annotations\":{},\"name\":\"eksa-test-2ffeb29\",\"namespace\":\"default\"},\"spec\":{\"clusterNetwork\":{\"cniConfig\":{\"cilium\":{}},\"pods\":{\"cidrBlocks\":[\"192.169.0.0/16\"]},\"services\":{\"cidrBlocks\":[\"10.96.0.0/12\"]}},\"controlPlaneConfiguration\":{\"count\":1,\"endpoint\":{\"host\":\"172.16.0.31:6443\"},\"machineGroupRef\":{\"kind\":\"CloudStackMachineConfig\",\"name\":\"eksa-test-2ffeb29-cp\"}},\"datacenterRef\":{\"kind\":\"CloudStackDatacenterConfig\",\"name\":\"eksa-test-2ffeb29\"},\"kubernetesVersion\":\"1.20\",\"managementCluster\":{\"name\":\"eksa-test-2ffeb29\"},\"workerNodeGroupConfigurations\":[{\"count\":2,\"machineGroupRef\":{\"kind\":\"CloudStackMachineConfig\",\"name\":\"eksa-test-2ffeb29\"},\"name\":\"md-0\"}]}}\n"}}}
to:
Resource: "anywhere.eks.amazonaws.com/v1alpha1, Resource=clusters", GroupVersionKind: "anywhere.eks.amazonaws.com/v1alpha1, Kind=Cluster"
Name: "eksa-test-2ffeb29", Namespace: "default"
for: "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.yaml": Internal error occurred: failed calling webhook "validation.cluster.anywhere.amazonaws.com": Post "https://eksa-webhook-service.eksa-system.svc:443/validate-anywhere-eks-amazonaws-com-v1alpha1-cluster?timeout=10s": read tcp 172.16.0.48:52988->10.103.244.98:443: read: connection reset by peer
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"anywhere.eks.amazonaws.com/paused":null,"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"anywhere.eks.amazonaws.com/v1alpha1\",\"kind\":\"CloudStackDatacenterConfig\",\"metadata\":{\"annotations\":{},\"name\":\"eksa-test-2ffeb29\",\"namespace\":\"default\"},\"spec\":{\"account\":\"admin\",\"domain\":\"ROOT\",\"managementApiEndpoint\":\"http://172.16.0.1:8080/client/api\",\"zones\":[{\"name\":\"zone1\",\"network\":{\"name\":\"Shared1\"}}]}}\n"}}}
to:
Resource: "anywhere.eks.amazonaws.com/v1alpha1, Resource=cloudstackdatacenterconfigs", GroupVersionKind: "anywhere.eks.amazonaws.com/v1alpha1, Kind=CloudStackDatacenterConfig"
Name: "eksa-test-2ffeb29", Namespace: "default"
for: "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.yaml": Internal error occurred: failed calling webhook "validation.cloudstackdatacenterconfig.anywhere.amazonaws.com": Post "https://eksa-webhook-service.eksa-system.svc:443/validate-anywhere-eks-amazonaws-com-v1alpha1-cloudstackdatacenterconfig?timeout=10s": read tcp 172.16.0.48:52992->10.103.244.98:443: read: connection reset by peer

maxdrib · 2022-04-20T21:24:42Z

After merging from main to include latest changes, test suddenly passed. Will rerun

--- PASS: TestCloudStackKubernetes120RedhatTo121Upgrade (1919.60s)
PASS

Edit: Rerun failed with same capi webhook error on move to workload cluster, so it seems to be nondeterministic, regardless of the dns: {} observations above

maxdrib · 2022-04-21T14:32:57Z

Another failure observed during the move back to workload cluster

2022-04-21T09:15:44.846-0400	V4	----------------------------------
2022-04-21T09:15:44.846-0400	V4	Task start	{"task_name": "capi-management-move-to-bootstrap"}
2022-04-21T09:15:44.846-0400	V0	Moving cluster management from workload to bootstrap cluster
2022-04-21T09:15:44.846-0400	V3	Waiting for management machines to be ready before move
2022-04-21T09:15:45.323-0400	V4	Nodes ready	{"total": 3}
2022-04-21T09:16:02.710-0400	V3	Waiting for control planes to be ready after move
2022-04-21T09:16:10.256-0400	V3	Waiting for workload cluster control plane replicas to be ready after move
2022-04-21T09:16:10.940-0400	V3	Waiting for workload cluster machine deployment replicas to be ready after move

2022-04-21T10:23:29.774-0400	V4	Task finished	{"task_name": "capi-management-move-to-bootstrap", "duration": "30m27.257060618s"}
2022-04-21T10:23:29.775-0400	V4	----------------------------------
2022-04-21T10:23:29.775-0400	V4	Task start	{"task_name": "collect-cluster-diagnostics"}
2022-04-21T10:23:29.775-0400	V0	collecting cluster diagnostics
2022-04-21T10:23:29.775-0400	V0	collecting management cluster diagnostics
2022-04-21T10:23:29.789-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:23:29-04:00-bundle.yaml"}
2022-04-21T10:23:29.789-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-21T10:23:30.184-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-21T10:23:34.643-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "bootstrap-cluster", "bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:23:29-04:00-bundle.yaml", "since": "2022-04-21T09:23:29.789-0400", "kubeconfig": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29.kind.kubeconfig"}
2022-04-21T10:25:20.036-0400	V0	Support bundle archive created	{"path": "support-bundle-2022-04-21T14_23_35.tar.gz"}
2022-04-21T10:25:20.037-0400	V0	Analyzing support bundle	{"bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:23:29-04:00-bundle.yaml", "archive": "support-bundle-2022-04-21T14_23_35.tar.gz"}
2022-04-21T10:25:20.891-0400	V0	Analysis output generated	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:25:20-04:00-analysis.yaml"}
2022-04-21T10:25:20.891-0400	V1	cleaning up temporary roles for diagnostic collectors
2022-04-21T10:25:21.454-0400	V1	cleaning up temporary namespace  for diagnostic collectors	{"namespace": "eksa-diagnostics"}
2022-04-21T10:25:28.151-0400	V0	collecting workload cluster diagnostics
2022-04-21T10:25:28.165-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-21T10:25:28-04:00-bundle.yaml"}
2022-04-21T10:25:28.165-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-21T10:28:20.358-0400	V0	WARNING: failed to create eksa-diagnostics namespace. Some collectors may fail to run.	{"err": "creating namespace eksa-diagnostics: Unable to connect to the server: dial tcp 172.16.0.31:6443: i/o timeout\n"}
2022-04-21T10:28:20.359-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-21T10:31:12.367-0400	V0	WARNING: failed to create roles for eksa-diagnostic-collector. Some collectors may fail to run.	{"err": "executing apply: Unable to connect to the server: dial tcp 172.16.0.31:6443: i/o timeout\n"}
2022-04-21T10:31:12.367-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "eksa-test-2ffeb29", "bundle": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-21T10:25:28-04:00-bundle.yaml", "since": "2022-04-21T09:25:28.165-0400", "kubeconfig": "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.kubeconfig"}
2022-04-21T10:31:42.836-0400	V4	Task finished	{"task_name": "collect-cluster-diagnostics", "duration": "8m13.073853654s"}
2022-04-21T10:31:42.837-0400	V4	----------------------------------
2022-04-21T10:31:42.837-0400	V4	Tasks completed	{"duration": "40m51.572074229s"}
2022-04-21T10:31:42.838-0400	V3	Cleaning up long running container	{"name": "eksa_1650546812263508000"}
Error: failed to upgrade cluster: waiting for workload cluster machinedeployment replicas to be ready: retries exhausted waiting for machinedeployment replicas to be ready: machine deployment is in  phase
    cluster.go:529: Command eksctl anywhere [upgrade cluster -f eksa-test-2ffeb29-config/cluster.yaml -v 4 --bundles-override bin/local-bundle-release.yaml] failed with error: exit status 255: Error: failed to upgrade cluster: waiting for workload cluster machinedeployment replicas to be ready: retries exhausted waiting for machinedeployment replicas to be ready: machine deployment is in  phase
2022-04-21T10:31:44.090-0400	V3	e2e	Cleaning up long running container	{"name": "eksa_1650546394748099000"}
--- FAIL: TestCloudStackKubernetes120RedhatTo121Upgrade (2872.14s)
FAIL

And capi logs say the CloudStackMachineTemplate "eksa-test-2ffeb29-md-0-6b77c5758b" can't be found

E0421 14:20:57.281244       1 controller.go:317] controller/machineset "msg"="Reconciler error" "error"="failed to retrieve CloudStackMachineTemplate external object \"eksa-system\"/\"eksa-test-2ffeb29-md-0-1650546540971\": cloudstackmachinetemplates.infrastructure.cluster.x-k8s.io \"eksa-test-2ffeb29-md-0-1650546540971\" not found" "name"="eksa-test-2ffeb29-md-0-6b77c5758b" "namespace"="eksa-system" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="MachineSet"

Full support bundle
support-bundle-2022-04-21T14_23_35.zip

maxdrib · 2022-04-21T15:33:38Z

Guillermo's theory: it could be a race condition after the upgrade where the worker nodes are stills being rolled out
in that case, there is a chance the capi webhooks (which run in a controller in the worker nodes) are not ready yet

maxdrib · 2022-04-25T17:47:06Z

We were able to fix the broken cluster by deleting and reapplying the cilium daemonset on the cluster

maxdrib · 2022-04-28T17:17:05Z

Looking in the kube-proxy logs, I saw some errors like the ones described in kubernetes/kubernetes#107482

maxdrib · 2022-04-28T19:47:27Z

Also, when I try running the cilium connectivity test, it often fails to start with logs very similar to cilium/cilium-cli#342

maxdrib · 2022-04-29T21:34:56Z

Experiment: query cert-manager-webhook different ways from different pods on the CP VM

TLDR Queries from kube-apiserver to the cluster-manager-webhook succeed when using the webhook’s pod endpoint, but are reset by peer when using the ClusterIP endpoint. DNS appears to be working fine
This leads me to suspect that there may be some issues with kube-proxy potentially which could be mapping the ClusterIP address to an incorrect pod IP

We tried restarting kubelet, deleting the kube-proxy pod, comparing /etc/hosts contents between pods, running cilium status on the cilium pods. It seems like the issue is limited to static pods running on the CP node.

Eventually deleting the cilium pod running on the CP vm allowed the cluster to return to a stable state.

maxdrib · 2022-05-02T22:30:00Z

Latest update - comparing iptables between the CP node and the worker node, we observe some rules are missing relating to Cilium. We suspect that reinstalling cilium will readd these rules and fix the broken cluster

broken_iptables.log
normal_iptables.log
broken_cilium_logs.log

…sue (#2057) * Restarting cilium daemonset on upgrade workflow to address #1888 * moving cilium restart to cluster manager * Fixing unit tests * Improving unit test coverage * Renaming daemonset restart method to be generic * Renaming tests to decouple from cilium

…ess upgrade issue (#2068) * Restarting cilium daemonset on upgrade workflow to address #1888 * moving cilium restart to cluster manager * Fixing unit tests * Improving unit test coverage * Renaming daemonset restart method to be generic * Renaming tests to decouple from cilium Co-authored-by: Max Dribinsky <[email protected]>

maxdrib changed the title ~~CloudStack upgrade breaks when dns networkConfig is set~~ CloudStack upgrade e2e test is failing with timeout on move Apr 20, 2022

g-gaston assigned maxdrib Apr 20, 2022

jaxesn added team/providers area/providers/capc labels Apr 20, 2022

maxdrib added external An issue, bug or feature request filed from outside the AWS org and removed external An issue, bug or feature request filed from outside the AWS org labels Apr 21, 2022

maxdrib added this to the next milestone Apr 27, 2022

mitalipaygude modified the milestones: next, next+1 May 5, 2022

maxdrib added a commit to maxdrib/eks-anywhere that referenced this issue May 6, 2022

Restarting cilium daemonset on upgrade workflow to address aws#1888

1d1325e

maxdrib mentioned this issue May 6, 2022

Restarting cilium daemonset on upgrade workflow to address upgrade issue #2057

Merged

maxdrib mentioned this issue May 9, 2022

Restarting cilium daemonset on upgrade workflow to address upgrade issue #2066

Closed

eks-distro-pr-bot pushed a commit to eks-distro-pr-bot/eks-anywhere that referenced this issue May 9, 2022

Restarting cilium daemonset on upgrade workflow to address aws#1888

b0cbf65

maxdrib closed this as completed May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CloudStack upgrade e2e test is failing with timeout on move #1888

CloudStack upgrade e2e test is failing with timeout on move #1888

maxdrib commented Apr 20, 2022 •

edited

Loading

maxdrib commented Apr 20, 2022

maxdrib commented Apr 20, 2022 •

edited

Loading

maxdrib commented Apr 20, 2022

maxdrib commented Apr 20, 2022 •

edited

Loading

maxdrib commented Apr 21, 2022 •

edited

Loading

maxdrib commented Apr 21, 2022

maxdrib commented Apr 25, 2022 •

edited

Loading

maxdrib commented Apr 28, 2022

maxdrib commented Apr 28, 2022

maxdrib commented Apr 29, 2022

maxdrib commented May 2, 2022 •

edited

Loading

CloudStack upgrade e2e test is failing with timeout on move #1888

CloudStack upgrade e2e test is failing with timeout on move #1888

Comments

maxdrib commented Apr 20, 2022 • edited Loading

maxdrib commented Apr 20, 2022

maxdrib commented Apr 20, 2022 • edited Loading

maxdrib commented Apr 20, 2022

maxdrib commented Apr 20, 2022 • edited Loading

maxdrib commented Apr 21, 2022 • edited Loading

maxdrib commented Apr 21, 2022

maxdrib commented Apr 25, 2022 • edited Loading

maxdrib commented Apr 28, 2022

maxdrib commented Apr 28, 2022

maxdrib commented Apr 29, 2022

maxdrib commented May 2, 2022 • edited Loading

maxdrib commented Apr 20, 2022 •

edited

Loading

maxdrib commented Apr 20, 2022 •

edited

Loading

maxdrib commented Apr 20, 2022 •

edited

Loading

maxdrib commented Apr 21, 2022 •

edited

Loading

maxdrib commented Apr 25, 2022 •

edited

Loading

maxdrib commented May 2, 2022 •

edited

Loading