Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudStack upgrade e2e test is failing with timeout on move #1888

Closed
maxdrib opened this issue Apr 20, 2022 · 11 comments
Closed

CloudStack upgrade e2e test is failing with timeout on move #1888

maxdrib opened this issue Apr 20, 2022 · 11 comments

Comments

@maxdrib
Copy link
Contributor

maxdrib commented Apr 20, 2022

What happened:
Error message displayed below:

Error: failed to upgrade cluster: moving CAPI management from source to target: failed moving management cluster: Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Moving Cluster API objects ClusterClasses=0
Creating objects in the target cluster
Error: action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=Cluster" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook "default.cluster.cluster.x-k8s.io": Post "[https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s](https://capi-webhook-service.capi-system.svc/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s)": context deadline exceeded

What you expected to happen:
Upgrade should succeed

How to reproduce it (as minimally and precisely as possible):
Execute the TestCloudStackKubernetes120RedhatTo121Upgrade e2e test again the code in main

Anything else we need to know?:

Environment:

  • EKS Anywhere Release:
  • EKS Distro Release:

cluster.yaml.zip

@maxdrib maxdrib changed the title CloudStack upgrade breaks when dns networkConfig is set CloudStack upgrade e2e test is failing with timeout on move Apr 20, 2022
@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 20, 2022

I was able to successfully perform an upgrade when using an older cluster spec. Comparing the cluster specs, I noticed a difference in the cluster.spec.networkConfig attributes where the working spec had:

  clusterNetwork:
    cni: cilium
    pods: ...
    services: ...

and the broken spec had

  clusterNetwork:
    cniConfig:
      cilium: {}
    dns: {}
    pods: ...
    services: ...

I continued to attempt to isolate the problem and was able to reproduce the issue by adding/removing the dns: {} attribute in the cluster spec. When present, it caused the cluster to hang on upgrade. Unfortunately, each test run takes ~45 minutes so it is tedious to test changes.

The dns attribute in clusterConfig is not present by default when running generate clusterconfig, as it is explicitly removed for cleanliness here. However, with the marshalling and unmarshalling done by the e2e tests, it is re-added and appears to be causing failures.

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 20, 2022

Additional logs with running clusterctl -v=9 on move

2022-04-20T14:56:50.779-0400	V4	Task start	{"task_name": "capi-management-move-to-workload"}
2022-04-20T14:56:50.779-0400	V0	Moving cluster management from bootstrap to workload cluster
2022-04-20T14:56:50.779-0400	V3	Waiting for management machines to be ready before move
2022-04-20T14:56:51.266-0400	V4	Nodes ready	{"total": 3}
2022-04-20T14:57:45.161-0400	V4	Task finished	{"task_name": "capi-management-move-to-workload", "duration": "54.379619357s"}
2022-04-20T14:57:45.161-0400	V4	----------------------------------
2022-04-20T14:57:45.161-0400	V4	Task start	{"task_name": "collect-cluster-diagnostics"}
2022-04-20T14:57:45.161-0400	V0	collecting cluster diagnostics
2022-04-20T14:57:45.161-0400	V0	collecting management cluster diagnostics
2022-04-20T14:57:45.170-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:57:45-04:00-bundle.yaml"}
2022-04-20T14:57:45.170-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-20T14:57:45.603-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-20T14:57:46.582-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "bootstrap-cluster", "bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:57:45-04:00-bundle.yaml", "since": "2022-04-20T13:57:45.170-0400", "kubeconfig": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29.kind.kubeconfig"}
2022-04-20T14:59:47.731-0400	V0	Support bundle archive created	{"path": "support-bundle-2022-04-20T18_57_47.tar.gz"}
2022-04-20T14:59:47.731-0400	V0	Analyzing support bundle	{"bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:57:45-04:00-bundle.yaml", "archive": "support-bundle-2022-04-20T18_57_47.tar.gz"}
2022-04-20T14:59:48.389-0400	V0	Analysis output generated	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-20T14:59:48-04:00-analysis.yaml"}
2022-04-20T14:59:48.389-0400	V1	cleaning up temporary roles for diagnostic collectors
2022-04-20T14:59:48.789-0400	V1	cleaning up temporary namespace  for diagnostic collectors	{"namespace": "eksa-diagnostics"}
2022-04-20T14:59:54.921-0400	V0	collecting workload cluster diagnostics
2022-04-20T14:59:54.925-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T14:59:54-04:00-bundle.yaml"}
2022-04-20T14:59:54.925-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-20T14:59:55.359-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-20T14:59:57.627-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "eksa-test-2ffeb29", "bundle": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T14:59:54-04:00-bundle.yaml", "since": "2022-04-20T13:59:54.925-0400", "kubeconfig": "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.kubeconfig"}
2022-04-20T15:01:59.961-0400	V0	Support bundle archive created	{"path": "support-bundle-2022-04-20T18_59_58.tar.gz"}
2022-04-20T15:01:59.961-0400	V0	Analyzing support bundle	{"bundle": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T14:59:54-04:00-bundle.yaml", "archive": "support-bundle-2022-04-20T18_59_58.tar.gz"}
2022-04-20T15:02:00.656-0400	V0	Analysis output generated	{"path": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-20T15:02:00-04:00-analysis.yaml"}
2022-04-20T15:02:00.656-0400	V1	cleaning up temporary roles for diagnostic collectors
2022-04-20T15:02:01.184-0400	V1	cleaning up temporary namespace  for diagnostic collectors	{"namespace": "eksa-diagnostics"}
2022-04-20T15:02:06.931-0400	V4	Task finished	{"task_name": "collect-cluster-diagnostics", "duration": "4m21.762248472s"}
2022-04-20T15:02:06.931-0400	V4	----------------------------------
2022-04-20T15:02:06.931-0400	V4	Tasks completed	{"duration": "33m53.267851981s"}
2022-04-20T15:02:06.931-0400	V3	Cleaning up long running container	{"name": "eksa_1650479292198787000"}
Error: failed to upgrade cluster: moving CAPI management from source to target: failed moving management cluster: No default config file available
Performing move...
Discovering Cluster API objects
KubeadmConfig Count=3
KubeadmControlPlane Count=1
CloudStackMachineTemplate Count=4
MachineHealthCheck Count=2
KubeadmConfigTemplate Count=1
MachineSet Count=2
ConfigMap Count=1
CloudStackCluster Count=1
MachineDeployment Count=1
Machine Count=3
Secret Count=9
CloudStackMachine Count=3
Cluster Count=1
Total objects Count=32
Excluding secret from move (not linked with any Cluster) name="default-token-89k7k"
Moving Cluster API objects Clusters=1
Moving Cluster API objects ClusterClasses=0
Pausing the source cluster
Set Cluster.Spec.Paused Paused=true Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Pausing the source cluster classes
Creating target namespaces, if missing
Creating objects in the target cluster
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47812-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47832-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47858-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47890-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:47934-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48010-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48098-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48238-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Retrying with backoff Cause="error creating \"cluster.x-k8s.io/v1beta1, Kind=Cluster\" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook \"default.cluster.cluster.x-k8s.io\": Post \"https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s\": read tcp 172.16.0.48:48462-\u003e10.108.124.180:443: read: connection reset by peer"
Creating Cluster="eksa-test-2ffeb29" Namespace="eksa-system"
Error: action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=Cluster" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook "default.cluster.cluster.x-k8s.io": Post "https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": read tcp 172.16.0.48:48772->10.108.124.180:443: read: connection reset by peer

    cluster.go:529: Command eksctl anywhere [upgrade cluster -f eksa-test-2ffeb29-config/cluster.yaml -v 4 --bundles-override bin/local-bundle-release.yaml] failed with error: exit status 255: Error: action failed after 10 attempts: error creating "cluster.x-k8s.io/v1beta1, Kind=Cluster" eksa-system/eksa-test-2ffeb29: Internal error occurred: failed calling webhook "default.cluster.cluster.x-k8s.io": Post "https://capi-webhook-service.capi-system.svc:443/mutate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": read tcp 172.16.0.48:48772->10.108.124.180:443: read: connection reset by peer
2022-04-20T15:02:07.327-0400	V3	e2e	Cleaning up long running container	{"name": "eksa_1650478924208997000"}
--- FAIL: TestCloudStackKubernetes120RedhatTo121Upgrade (2403.40s)
FAIL
{"error":"missing_field_value","ok":false,"response_metadata":{"messages":["[ERROR] empty required field: 'args'"]}}curl: (3) unmatched close brace/bracket in URL position 3:
0"}

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 20, 2022

Logs and resources on the cluster here: support-bundle-2022-04-19T18_28_43.zip

After the move failed, I tried applying the eksa cluster spec manually on the workload cluster, and was able to reproduce the connection reset error

➜  eks-anywhere git:(main) ✗ k apply -f eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.yaml --kubeconfig eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.kubeconfig
cloudstackmachineconfig.anywhere.eks.amazonaws.com/eksa-test-2ffeb29-cp unchanged
cloudstackmachineconfig.anywhere.eks.amazonaws.com/eksa-test-2ffeb29 unchanged
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"anywhere.eks.amazonaws.com/paused":null,"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"anywhere.eks.amazonaws.com/v1alpha1\",\"kind\":\"Cluster\",\"metadata\":{\"annotations\":{},\"name\":\"eksa-test-2ffeb29\",\"namespace\":\"default\"},\"spec\":{\"clusterNetwork\":{\"cniConfig\":{\"cilium\":{}},\"pods\":{\"cidrBlocks\":[\"192.169.0.0/16\"]},\"services\":{\"cidrBlocks\":[\"10.96.0.0/12\"]}},\"controlPlaneConfiguration\":{\"count\":1,\"endpoint\":{\"host\":\"172.16.0.31:6443\"},\"machineGroupRef\":{\"kind\":\"CloudStackMachineConfig\",\"name\":\"eksa-test-2ffeb29-cp\"}},\"datacenterRef\":{\"kind\":\"CloudStackDatacenterConfig\",\"name\":\"eksa-test-2ffeb29\"},\"kubernetesVersion\":\"1.20\",\"managementCluster\":{\"name\":\"eksa-test-2ffeb29\"},\"workerNodeGroupConfigurations\":[{\"count\":2,\"machineGroupRef\":{\"kind\":\"CloudStackMachineConfig\",\"name\":\"eksa-test-2ffeb29\"},\"name\":\"md-0\"}]}}\n"}}}
to:
Resource: "anywhere.eks.amazonaws.com/v1alpha1, Resource=clusters", GroupVersionKind: "anywhere.eks.amazonaws.com/v1alpha1, Kind=Cluster"
Name: "eksa-test-2ffeb29", Namespace: "default"
for: "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.yaml": Internal error occurred: failed calling webhook "validation.cluster.anywhere.amazonaws.com": Post "https://eksa-webhook-service.eksa-system.svc:443/validate-anywhere-eks-amazonaws-com-v1alpha1-cluster?timeout=10s": read tcp 172.16.0.48:52988->10.103.244.98:443: read: connection reset by peer
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"anywhere.eks.amazonaws.com/paused":null,"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"anywhere.eks.amazonaws.com/v1alpha1\",\"kind\":\"CloudStackDatacenterConfig\",\"metadata\":{\"annotations\":{},\"name\":\"eksa-test-2ffeb29\",\"namespace\":\"default\"},\"spec\":{\"account\":\"admin\",\"domain\":\"ROOT\",\"managementApiEndpoint\":\"http://172.16.0.1:8080/client/api\",\"zones\":[{\"name\":\"zone1\",\"network\":{\"name\":\"Shared1\"}}]}}\n"}}}
to:
Resource: "anywhere.eks.amazonaws.com/v1alpha1, Resource=cloudstackdatacenterconfigs", GroupVersionKind: "anywhere.eks.amazonaws.com/v1alpha1, Kind=CloudStackDatacenterConfig"
Name: "eksa-test-2ffeb29", Namespace: "default"
for: "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.yaml": Internal error occurred: failed calling webhook "validation.cloudstackdatacenterconfig.anywhere.amazonaws.com": Post "https://eksa-webhook-service.eksa-system.svc:443/validate-anywhere-eks-amazonaws-com-v1alpha1-cloudstackdatacenterconfig?timeout=10s": read tcp 172.16.0.48:52992->10.103.244.98:443: read: connection reset by peer

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 20, 2022

After merging from main to include latest changes, test suddenly passed. Will rerun

--- PASS: TestCloudStackKubernetes120RedhatTo121Upgrade (1919.60s)
PASS

Edit: Rerun failed with same capi webhook error on move to workload cluster, so it seems to be nondeterministic, regardless of the dns: {} observations above

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 21, 2022

Another failure observed during the move back to workload cluster

2022-04-21T09:15:44.846-0400	V4	----------------------------------
2022-04-21T09:15:44.846-0400	V4	Task start	{"task_name": "capi-management-move-to-bootstrap"}
2022-04-21T09:15:44.846-0400	V0	Moving cluster management from workload to bootstrap cluster
2022-04-21T09:15:44.846-0400	V3	Waiting for management machines to be ready before move
2022-04-21T09:15:45.323-0400	V4	Nodes ready	{"total": 3}
2022-04-21T09:16:02.710-0400	V3	Waiting for control planes to be ready after move
2022-04-21T09:16:10.256-0400	V3	Waiting for workload cluster control plane replicas to be ready after move
2022-04-21T09:16:10.940-0400	V3	Waiting for workload cluster machine deployment replicas to be ready after move

2022-04-21T10:23:29.774-0400	V4	Task finished	{"task_name": "capi-management-move-to-bootstrap", "duration": "30m27.257060618s"}
2022-04-21T10:23:29.775-0400	V4	----------------------------------
2022-04-21T10:23:29.775-0400	V4	Task start	{"task_name": "collect-cluster-diagnostics"}
2022-04-21T10:23:29.775-0400	V0	collecting cluster diagnostics
2022-04-21T10:23:29.775-0400	V0	collecting management cluster diagnostics
2022-04-21T10:23:29.789-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:23:29-04:00-bundle.yaml"}
2022-04-21T10:23:29.789-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-21T10:23:30.184-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-21T10:23:34.643-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "bootstrap-cluster", "bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:23:29-04:00-bundle.yaml", "since": "2022-04-21T09:23:29.789-0400", "kubeconfig": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29.kind.kubeconfig"}
2022-04-21T10:25:20.036-0400	V0	Support bundle archive created	{"path": "support-bundle-2022-04-21T14_23_35.tar.gz"}
2022-04-21T10:25:20.037-0400	V0	Analyzing support bundle	{"bundle": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:23:29-04:00-bundle.yaml", "archive": "support-bundle-2022-04-21T14_23_35.tar.gz"}
2022-04-21T10:25:20.891-0400	V0	Analysis output generated	{"path": "eksa-test-2ffeb29/generated/bootstrap-cluster-2022-04-21T10:25:20-04:00-analysis.yaml"}
2022-04-21T10:25:20.891-0400	V1	cleaning up temporary roles for diagnostic collectors
2022-04-21T10:25:21.454-0400	V1	cleaning up temporary namespace  for diagnostic collectors	{"namespace": "eksa-diagnostics"}
2022-04-21T10:25:28.151-0400	V0	collecting workload cluster diagnostics
2022-04-21T10:25:28.165-0400	V3	bundle config written	{"path": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-21T10:25:28-04:00-bundle.yaml"}
2022-04-21T10:25:28.165-0400	V1	creating temporary namespace for diagnostic collector	{"namespace": "eksa-diagnostics"}
2022-04-21T10:28:20.358-0400	V0	WARNING: failed to create eksa-diagnostics namespace. Some collectors may fail to run.	{"err": "creating namespace eksa-diagnostics: Unable to connect to the server: dial tcp 172.16.0.31:6443: i/o timeout\n"}
2022-04-21T10:28:20.359-0400	V1	creating temporary ClusterRole and RoleBinding for diagnostic collector
2022-04-21T10:31:12.367-0400	V0	WARNING: failed to create roles for eksa-diagnostic-collector. Some collectors may fail to run.	{"err": "executing apply: Unable to connect to the server: dial tcp 172.16.0.31:6443: i/o timeout\n"}
2022-04-21T10:31:12.367-0400	V0	⏳ Collecting support bundle from cluster, this can take a while	{"cluster": "eksa-test-2ffeb29", "bundle": "eksa-test-2ffeb29/generated/eksa-test-2ffeb29-2022-04-21T10:25:28-04:00-bundle.yaml", "since": "2022-04-21T09:25:28.165-0400", "kubeconfig": "eksa-test-2ffeb29/eksa-test-2ffeb29-eks-a-cluster.kubeconfig"}
2022-04-21T10:31:42.836-0400	V4	Task finished	{"task_name": "collect-cluster-diagnostics", "duration": "8m13.073853654s"}
2022-04-21T10:31:42.837-0400	V4	----------------------------------
2022-04-21T10:31:42.837-0400	V4	Tasks completed	{"duration": "40m51.572074229s"}
2022-04-21T10:31:42.838-0400	V3	Cleaning up long running container	{"name": "eksa_1650546812263508000"}
Error: failed to upgrade cluster: waiting for workload cluster machinedeployment replicas to be ready: retries exhausted waiting for machinedeployment replicas to be ready: machine deployment is in  phase
    cluster.go:529: Command eksctl anywhere [upgrade cluster -f eksa-test-2ffeb29-config/cluster.yaml -v 4 --bundles-override bin/local-bundle-release.yaml] failed with error: exit status 255: Error: failed to upgrade cluster: waiting for workload cluster machinedeployment replicas to be ready: retries exhausted waiting for machinedeployment replicas to be ready: machine deployment is in  phase
2022-04-21T10:31:44.090-0400	V3	e2e	Cleaning up long running container	{"name": "eksa_1650546394748099000"}
--- FAIL: TestCloudStackKubernetes120RedhatTo121Upgrade (2872.14s)
FAIL

And capi logs say the CloudStackMachineTemplate "eksa-test-2ffeb29-md-0-6b77c5758b" can't be found

E0421 14:20:57.281244       1 controller.go:317] controller/machineset "msg"="Reconciler error" "error"="failed to retrieve CloudStackMachineTemplate external object \"eksa-system\"/\"eksa-test-2ffeb29-md-0-1650546540971\": cloudstackmachinetemplates.infrastructure.cluster.x-k8s.io \"eksa-test-2ffeb29-md-0-1650546540971\" not found" "name"="eksa-test-2ffeb29-md-0-6b77c5758b" "namespace"="eksa-system" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="MachineSet" 

Full support bundle
support-bundle-2022-04-21T14_23_35.zip

@maxdrib maxdrib added external An issue, bug or feature request filed from outside the AWS org and removed external An issue, bug or feature request filed from outside the AWS org labels Apr 21, 2022
@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 21, 2022

Guillermo's theory: it could be a race condition after the upgrade where the worker nodes are stills being rolled out
in that case, there is a chance the capi webhooks (which run in a controller in the worker nodes) are not ready yet

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 25, 2022

We were able to fix the broken cluster by deleting and reapplying the cilium daemonset on the cluster

@maxdrib maxdrib added this to the next milestone Apr 27, 2022
@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 28, 2022

Looking in the kube-proxy logs, I saw some errors like the ones described in kubernetes/kubernetes#107482

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 28, 2022

Also, when I try running the cilium connectivity test, it often fails to start with logs very similar to cilium/cilium-cli#342

@maxdrib
Copy link
Contributor Author

maxdrib commented Apr 29, 2022

Experiment: query cert-manager-webhook different ways from different pods on the CP VM

TLDR Queries from kube-apiserver to the cluster-manager-webhook succeed when using the webhook’s pod endpoint, but are reset by peer when using the ClusterIP endpoint. DNS appears to be working fine
This leads me to suspect that there may be some issues with kube-proxy potentially which could be mapping the ClusterIP address to an incorrect pod IP

We tried restarting kubelet, deleting the kube-proxy pod, comparing /etc/hosts contents between pods, running cilium status on the cilium pods. It seems like the issue is limited to static pods running on the CP node.

Eventually deleting the cilium pod running on the CP vm allowed the cluster to return to a stable state.

@maxdrib
Copy link
Contributor Author

maxdrib commented May 2, 2022

Latest update - comparing iptables between the CP node and the worker node, we observe some rules are missing relating to Cilium. We suspect that reinstalling cilium will readd these rules and fix the broken cluster
Screen Shot 2022-05-02 at 3 28 22 PM
broken_iptables.log
normal_iptables.log
broken_cilium_logs.log

@mitalipaygude mitalipaygude modified the milestones: next, next+1 May 5, 2022
maxdrib added a commit to maxdrib/eks-anywhere that referenced this issue May 6, 2022
vivek-koppuru pushed a commit that referenced this issue May 7, 2022
…sue (#2057)

* Restarting cilium daemonset on upgrade workflow to address #1888

* moving cilium restart to cluster manager

* Fixing unit tests

* Improving unit test coverage

* Renaming daemonset restart method to be generic

* Renaming tests to decouple from cilium
eks-distro-pr-bot pushed a commit to eks-distro-pr-bot/eks-anywhere that referenced this issue May 9, 2022
eks-distro-bot pushed a commit that referenced this issue May 9, 2022
…ess upgrade issue (#2068)

* Restarting cilium daemonset on upgrade workflow to address #1888

* moving cilium restart to cluster manager

* Fixing unit tests

* Improving unit test coverage

* Renaming daemonset restart method to be generic

* Renaming tests to decouple from cilium

Co-authored-by: Max Dribinsky <[email protected]>
@maxdrib maxdrib closed this as completed May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants