-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI failure: capi-e2e-release-1-1-1-23-1-24 is failing consistently #7768
Comments
Similar issue reported for image pullback in kubeadm: kubernetes/kubeadm#2761 Looks like the solution suggested was reverting imageRegistry from Any thoughts @fabriziopandini @killianmuldoon @ykakarap @CecileRobertMichon @oscr |
Actually, #7410 was changing the registry from |
Is that image available in the Running locally:
However it is available here:
|
It is not but https://github.com/kubernetes-sigs/cluster-api/blob/release-1.1/controlplane/kubeadm/internal/workload_cluster_coredns.go#L220-L224 should handle the transformation automatically AFAIU and that image is available:
|
Interesting, we are seeing the same error in CAPZ but with CAPI 1.3... https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capi-periodic-upgrade-main-1-23-1-24 |
This is confusing - there shouldn't be a coreDNS upgrade from 1.23 - 1.24. If I've read it right 1.23 uses coreDNS v1.86 by default, and the directive is to upgrade to v1.8.6 so it should be a no-op. I'm not able to replicate this locally, but I'm trying to figure out what's going on. |
In CAPA CSIMigration tests, we are upgrading from v1.22 to v1.23. |
Can you point me to the test spec for this? The issue seems maybe different from the one this issue covers. |
Here is the test: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/test/e2e/suites/unmanaged/unmanaged_functional_test.go#L326 We are facing this issue only after bumping CAPI: kubernetes-sigs/cluster-api-provider-aws#3920 |
If the issue seems different, I can raise it separately. |
My best guess after trying to replicate this is that there is a clash between the CAPI code that works to fix the coreDNS image name for older versions and the update to kubeadm which uses the new registry for coredns. The root cause is that the version of Kubernetes in use in the e2e tests has been updated to use Kubeadm versions with the registry fix. There aren't yet kindest/node images available. I think this should be fixable by specifying the imageRegistry in the clusterConfiguration field of the KubeadmControlPlane. I've got a version which does this for the failing 1.1 branch, but I'd like to find out if it can fix the issues on CAPZ or CAPA. Starting next week I'll build a kindest/node image with the new versions of 1.23 and 1.24 so I can properly test the fix with CAPD. |
I am not sure why #6917 was not backported to the |
/cc @chrischdi |
Hi @furkatgofurov7 , it was not backported to release-1.1 because it was already out of support: #6917 (comment) . |
Also: if no |
@chrischdi hi! The defaults in kubeadm points to "registry.k8s.io" is not it? |
The default of kubeadm highly depends on the used version of the kubeadm binary. In this case, the test seems to fail since ~ the release of kubernetes v1.24.9. The changelog also states an interesting kubeadm feature:
The above test uses ClusterClass and makes use of the default value for the imageRepository of the ClusterClass, which is:
[src rendered] / [src clusterclass default] It looks like this setting together with kubeadm v1.24.9 breaks this case. |
outside of capi (kubeadm only) i saw a number of failures where users pinned to registry.k8s.io before it became the default in kubeadm version foo. that is not right and users should leave any registry fields blank, unless they are pinning to a non default, custom registry (not the old or new default). also, again i greatly regret agreeing to making the coredns paths inconsistent depending or default / non default registry. ironically it was capi maintainers that pushed for this. |
/kind failing-test |
I think the difference why the test started failing now is that previously kubeadm auto-migrated from k8s.gcr.io/coredns => k8s.gcr.io/coredns/coredns and now it doesn't do that for the old registry anymore. I think the solution for our release-1.1 tests might just be to stop pinning the registry in the e2e test. If I see correctly that's also what we are doing on newer branches. I think for now it would be fine to make this change on release-1.1 even though it's out of support (it's only a e2e test change). I will bring up the point about dropping tests of out-of-support CAPI versions independently. I think the main point for users here is: (as written by Lubomir)
Essentially this is what kubeadm supports and as KCP is tightly coupled to kubeadm we should support the same. KCP is slightly more flexible as it auto-migrate coredns => coredns/coredns for both registries (in v1.3 and v1.2.8+), but apart from that I think we are just aligned to the kubeadm behavior. If I understood it correctly there's nothing to do for CAPI main / 1.3.x / 1.2.x |
interesting I was not aware of this.
correct
I thought it is wider then that, have you checked #7768 (comment) because CAPZ is failing using CAPI v1.3 and CAPA #7768 (comment) (this I have not checked myself and not sure if it is the same issue). Some logs from the CAPM3 provider, basically cloud init is failing in the upgraded node:
|
@Ankitasw @CecileRobertMichon Can you please confirm with which versions of CAPI you are seeing this issue? As far as I can see CAPA is on v1.2.7 and for CAPZ there seem to be contradictory statements
vs.
|
Which Cluster API and Kubernetes version is exactly used in this test? I would suspect it's a Kubernetes version which doesn't have the registry default change (not the latest patch version). Are you pinning the registry to |
We have a CAPI bump PR open in CAPA and we are facing this issue while running tests in that PR. We first bumped it to 1.3.0, from there itself we observed CCM migration test failures with this image not found error. Also, we are using registry as |
Here are the details:
Edit: clusterctl is v1.2.8 but go modules is at v1.2.6 |
@sbueringer so the upgraded Kubernetes version should be the latest patched release? that should be |
The following patch versions should have the change:
I can look into the different cases, but it takes a bit until I have time for that. EDIT: The patch versions I listed for v1.22, v1.23 and v1.24 where one too low, updated now |
We don't pin the imageRepository in CAPA, and even after using above patch versions, we are getting this error in CAPA CCM migration tests. |
@sbueringer that confirms what I'm seeing and why the CAPZ test mysteriously fixed itself (we automatically pick up new patches when the versions become available):
|
@CecileRobertMichon This would make sense if the test sets the imageRepository to Is the imageRepo (in KCP) set in this test? EDIT: Checked the resources in https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-23-1-24-main/1603466731035037696/artifacts/clusters/bootstrap/resources/ imageRepository is not set in KCP. I wonder where this test run got EDIT 2: Okay I have a working theory what happened:
If I'm correct this essentially means that Cluster API v1.2.8 and v1.3.0 are only compatible with Kubernetes >= v1.22.17, >= v1.23.15, >= v1.24.9, >= v1.25.0 (if the kubeadm providers are used) P.S. The patch verions in #7768 (comment) were initially one too low, updated now. |
To confirm my theory: @lentzi90 Would it be possible to check if the upgrade test works when upgrading to v1.24.9? @Ankitasw Would it be possible to check if the upgrade test works when upgrading to v1.23.15? @furkatgofurov7 I don't know which minor versions you are using in the test, but can you please also retry with the latest Kubernetes patch releases? |
The tests are passing with updated k8s patch version, thanks a lot @sbueringer for the analysis 🙇♀️. Great work 👍 |
I will come back with the results after testing with the new k8s patch version |
It works! 🎉 Thanks for the help! |
@sbueringer you were right, tests are passing after uplifting k8s version 👍🏼 |
/triage accepted it seems that we nailed down this problem across the board. amazing team work! |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We still have an issue with kubeadm v1.22.x v1.23.x v1.24.x binaries that use the old registry. I've opened an issue to follow-up for those cases: #7833 |
release-1.1 branch https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.1#capi-e2e-release-1-1-1-23-1-24 job is failing consistently since December 9th.
Logs from the corresponding prow job:
Dec 15 12:41:10.938: INFO: At 2022-12-15 12:31:23 +0000 UTC - event for coredns-74f7f66b6f-s6m5s: {kubelet k8s-upgrade-and-conformance-0v458z-md-0-lcjxn-69fd749f7f-5dlnz} Failed: Failed to pull image "k8s.gcr.io/coredns:v1.8.6": rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/coredns:v1.8.6": failed to resolve reference "k8s.gcr.io/coredns:v1.8.6": k8s.gcr.io/coredns:v1.8.6: not found
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
To sum up:
This is fixed in #7787 but I will keep this open for few days to see the CI signal and close it after. In CAPI release-1.1 branch e2e tests the registry was pinned to
k8s.gcr.io
as default and removing the registry pinning and leaving it empty solved the CI failure reported in this issue.Also, please check the comments/suggestions from @sbueringer here and here to further understand the root cause of the issue and the possible ways to fix it forward in case you are seeing it.
The text was updated successfully, but these errors were encountered: