-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cri-o 1.17.0 breaks ARM clusters #3233
Comments
We should update containers/image to v5.2.0 which relaxes the the arch checks that were added with v5.1.0. |
Still the case with cri-o 1.17.3
I am really disappointed about this happening, don't you have any tests in place for common use cases? |
It's hard to test against broken images and without a doubt, the pause:3.1 is broken. In the meantime, it has been fixed and released with pause:3.2. Can you set |
I understand the argument that pause is broken - but if that is deemed broken there are many many images that are built that way and worked fine with cri-o 1.16.1 and stopped working with cri-o 1.17.0. From my perspective as a user it feels like crio is a little ... arrogant to force the world to change the behavior just to please its model of a perfect world not taking into account the current state of affairs. Don't get me wrong, it was a conscious decision for me to use cri-o instead of another container runtime on the ARM cluster exactly to be able to provide feedback and help testing it out. But if it turns out that I am stuck with an old version of cri-o I won't be able to do that. |
Note that this regression is unintentional.
We are between a rock and hard place since especially ARM users needed support to reliably select the correct image/manifest from a manifest list (or OCI index) with an extended support to select by variant. The error here is new. I believe we can account for cases where the list/index claims a different arch/os than the image's manifest (and use these as a fallback when selecting the manifests). @mtrmac @nalind WDYT?
You can do that by pulling by digest.
Did you try that? |
That would be lovely! Resorting to an image hash would be indeed a solution, thanks for pointing that out, though it is quite inconvenient.
Yes, that worked, for the pause image. After about 10 minutes trying to (re)start pods crio crashes, but this could be totally unrelated and caused by the underlying hardware, but in case it is of any help:
I will attach the stack trace from the go process. I really do look forward to some sort of fallback solution when choosing the right container image like you drafted. Sounds like the best of both worlds. Thanks for taking care! |
In the cases where I've seen them disagree, the mainfest list was correct while the image's manifest was not. Looking at |
@nalind, the "arm64" entry points to a manifest which claims to be for "amd64". The variant selection code will also match if the variant is empty. |
The error message in #3233 (comment) looks like it's coming from |
There was never any conscious decision to break images, even if for some bigger benefit. In the first instance, we were running into cases of users using mismatching architectures when they didn’t mean to and we weren’t aware of images that the users wanted to run that were malformed. This instance is a bug in a supposed improvement to ARM support (by differentiating the various ARM variants). I’m sorry that we didn’t notice this case — the code looked good, had tests, it is just a situation that was missed, apparently by ARM users who wanted this improvement as well. I think the primary thing that could help for all such similar cases is having ARM clusters in CI, if someone were interested in making that work and there were infrastructure available to run the tests there (and if it didn’t slow down the testing significantly). Or maybe just a sanity check to run on ARM before a release. Either way, I’m not aware of the current CI architecture having ARM hosts — but that may well just be my ignorance.
Yup. To reproduce on non-ARM, $ skopeo --override-os linux --override-arch arm --override-variant v7 copy docker://k8s.gcr.io/pause:3.1 dir:t |
I didn't want to sound rude in any way, sorry if that was the case. Thanks for explaining the situation! While I don't have personal experience with it, the Amazon EC2 Graviton instances might be worth a try since they are massively faster than raspberry or other SBCs. Amazon even offers credits for opensource projects: https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ |
True. I misread the code while skimming. |
#3626 was filed for the “new” variant of this issue, do we want to keep two issues for two variants of the problem (and close this one as fixed previously), or to keep a single issue (and close one of the two as a duplicate)? For the record, this is tracked in containers/image#898 . |
I agree that these two issues are related. The only thing setting them apart, and therefore might be a reason to keep them seperate, is that #3626 was introduced in 1.17.3. |
It looks to me like this issue was just backported to 1.16.6 Two days ago, an update came through the opensuse repo I'm using for cri-o (cri-o-1.16/unknown,now 1.16.6~1 arm64 [installed]) and since then, I'm getting "Error choosing an image from manifest list docker://k8s.gcr.io/pause:3.1: no image found in manifest list for architecture arm64, variant v8, OS linux" for every image on my Raspberry Pi K8s cluster. Can someone confirm? |
Hi, I'm getting a similar error:
See below my configuration:
I'm working on a Raspberry Pi 4. If you need more informations just let me know. |
I ran into this error while deploying kubespray on a Pine64 cluster running Fedora 32. I'd like to continue with kube 1.18, but the calico images will not pull with crictl because of this issue. Nodes are Pine64 Sopine modules, which report arm64 v8.
|
@therevoman AFAICT, this will be fixed in #3797. I will work on backporting the required PRs to 1.17 and 1.16. thanks for the patience folks 👍 |
Hi, is there any update? It isn't still working for now. |
what version @steled ? |
Hi @haircommander, see below a list of some information.
Hope this helps. |
sorry for the delay @steled this is fixed in cri-o 1.17.5, can you try that version? |
Hi @haircommander, no problem.
And can't find such version also at: http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/Raspbian_10/ May I miss some source? |
oops my mistake, we never actually released 1.17.5. I am working with the other maintainers on getting it out the door! |
Sounds great. Will wait for the new release. |
1.17.5 is in the OBS repo now: Also, please refer to install.md for newer installation instructions. I see you're downloading using the old strategy (from the main libcontainers:stable repo) rather than the new strategy (creating specific cri-o releases in a libcontainers:stable:cri-o repo) I've attempted to add rasbian 10 and 9 to meet your needs :smiley: |
Hi @haircommander, thanks for your support and the tip with the new strategy. EDIT: EDIT 2: |
oh goodness, yes I did seem to break something. looking into it @steled |
okay @steled the issue should be fixed on Debian. there's something up with the Raspbian installation that I don't understand, that will have to wait for tomorrow, unless Debian installation meets your needs |
Good morning @haircommander, I'm using Debian as test environment and Raspbian as prod so Debian is fine for now. Thanks and best regards |
Hi @haircommander, just to let you know. Best regards and a nice weekend |
Hi @haircommander, I tested a clean install of
I can't find a |
interesting, |
ok I opened ticket #4177 ... I think that the |
it seems that the part
I added at the file without this I get the following error at pod creation:
|
that seems to be from the same cause as #4177, for some reason, the debian builds aren't picking up overridden defaults |
I am closing this, as 1.17.5 has been released. please reopen if you disagree |
sorry but is it right that builds for Raspbian failed, see: https://build.opensuse.org/package/live_build_log/devel:kubic:libcontainers:stable:cri-o:1.17/cri-o/Raspbian_10/armv7l it also seems that there isn't a new build at: https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable:/cri-o:/1.17/Raspbian_10/ btw. it seems that I can't reopen the issue |
After updating crio to 1.17.0 my arm cluster got broken, the essential pause image could no longer be pulled, pods do not start after the upgrade since the images are cross-build on amd64 for ARM64 resulting in an image that is in fact usable on arm64 but cri-o refuses to try.
Trying to pull manually:
From the logfiles
The general problem was reported in this issue containers/podman#4849 a workaround for podman was implemented.
As long as crossplatform builds result in images tagged with the wrong architecture I see it as a bug in cri-o to be unable to use the images that worked in the past. afaik there is no easy fix for the bad tagging, especially with images build on dockerhub with limited options to influence the build process.
Current workaround: Downgrade to cri-o 1.16.1
The text was updated successfully, but these errors were encountered: