-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CloudStack upgrade e2e test is failing with timeout on move #1888
Comments
I was able to successfully perform an upgrade when using an older cluster spec. Comparing the cluster specs, I noticed a difference in the cluster.spec.networkConfig attributes where the working spec had:
and the broken spec had
I continued to attempt to isolate the problem and was able to reproduce the issue by adding/removing the The |
Additional logs with running clusterctl -v=9 on move
|
Logs and resources on the cluster here: support-bundle-2022-04-19T18_28_43.zip After the move failed, I tried applying the eksa cluster spec manually on the workload cluster, and was able to reproduce the connection reset error
|
After merging from main to include latest changes, test suddenly passed. Will rerun
Edit: Rerun failed with same capi webhook error on move to workload cluster, so it seems to be nondeterministic, regardless of the |
Another failure observed during the move back to workload cluster
And capi logs say the CloudStackMachineTemplate "eksa-test-2ffeb29-md-0-6b77c5758b" can't be found
Full support bundle |
Guillermo's theory: it could be a race condition after the upgrade where the worker nodes are stills being rolled out |
We were able to fix the broken cluster by deleting and reapplying the cilium daemonset on the cluster |
Looking in the kube-proxy logs, I saw some errors like the ones described in kubernetes/kubernetes#107482 |
Also, when I try running the cilium connectivity test, it often fails to start with logs very similar to cilium/cilium-cli#342 |
Experiment: query cert-manager-webhook different ways from different pods on the CP VM TLDR Queries from kube-apiserver to the cluster-manager-webhook succeed when using the webhook’s pod endpoint, but are reset by peer when using the ClusterIP endpoint. DNS appears to be working fine We tried restarting kubelet, deleting the kube-proxy pod, comparing /etc/hosts contents between pods, running Eventually deleting the cilium pod running on the CP vm allowed the cluster to return to a stable state. |
Latest update - comparing iptables between the CP node and the worker node, we observe some rules are missing relating to Cilium. We suspect that reinstalling cilium will readd these rules and fix the broken cluster |
…ess upgrade issue (#2068) * Restarting cilium daemonset on upgrade workflow to address #1888 * moving cilium restart to cluster manager * Fixing unit tests * Improving unit test coverage * Renaming daemonset restart method to be generic * Renaming tests to decouple from cilium Co-authored-by: Max Dribinsky <[email protected]>
What happened:
Error message displayed below:
What you expected to happen:
Upgrade should succeed
How to reproduce it (as minimally and precisely as possible):
Execute the TestCloudStackKubernetes120RedhatTo121Upgrade e2e test again the code in main
Anything else we need to know?:
Environment:
cluster.yaml.zip
The text was updated successfully, but these errors were encountered: