-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calico-node 3.27.2 fails to start on arm: libpcap.so.0.8: cannot open shared object file #8541
Comments
Should fix by PR #8533 but unfortunately too late for v3.27.2 (was v3.27.1). |
@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save? |
Also waiting on this fix, not sure if you have a timeline for v3.27.3 release ? |
|
Same error on fedora 39 on arm (only available package: libcap-ng, libcap.so.2, libcap.so.2.48 |
While I'm just beating a dead horse - this is also broken on Bottlerocket on ARM hosts. |
Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March. |
adding myself to get notifications ^^ |
Waiting for a fix too, rolling back to 3.27.0 worked fine on Apple Silicon powered VM.
|
Spent hours! Trying to get Calico to work on ARM hosts, only to eventually find this and rollback to 27.0, this is a nasty bug! |
Nice this is in progress, adding myself for future notification too |
This is high impact. Hope to see a release this week as promised. Thank you! |
As an end-user; Pulling I'm unlikely to adopt future builds aggressively and will be waiting for downstream projects to dogfood a release before adoption. It's unclear if the devs released this knowing there was a defect, or, if the testing in place is insufficient to catch such an obvious problem. The latter is at least correctable, the former creates trust issues. As a developer; Rushing is how mistakes are made, I have no inclination to rush other developers. We get it when it's ready. 👍 Thanks to the devs for all their hard work on this project and related projects! 👏 |
I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines. This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far). Generally, a lot of these projects are half-baked and not mature. Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem. |
I strongly agree, was pretty crazy from my side to detect the issue in the cluster after one week. If this issue is a very well known one, because there are several duplicated issues with the same problem, at least the documentation page should be updated to install the version 3.27.0 and skip the problematic version. |
Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now. I can say that we are working on adding test environments running on ARM as we speak in order to prevent these types of issues in the future. It takes time and resources to build out full e2e runs for the various combinations of installer, cloud, distro, architecture, etc. So this will be an ongoing project to fill out the matrix. Thanks all for your help and patience to-date! |
testing efficacy aside, the thing that would really help everyone is that when a defect is discovered in a release leaving the release unusable, ie. no workaround other than to elect a different version, and a fix cannot be released within some reasonable time frame (you decide, but 24-48 hours seems fair IMO) the release page could be manually updated with a NOTE: Breaking change warning users, and if available, linking to the Issue(s) that discuss the break. it's not an admittance of fault or a promise to fix, it's a respectable warning to the community that they may want to elect a different version if the Issue would affect them (or their downstream users.) it's better than taking a release down. it's better than combing through posts from frustrated users. currently the release page for 3.27.2 makes no mention that it will not work with arm64 nodes, but this defect/Issue has been open for roughly a month. if the release notes for 3.27.2 linked to this issue myself and others could have saved time by electing a prior release and subscribing to this issue to wait for closure like normal people, and wouldn't have given it a second thought. when i push new versions of critical components (and it doesn't get more critical than a CNI upgrade) i actually pull a couple nodes from my cluster (varying by OS version and platform architecture), reset them, and rejoin them, then verify logs and functionality. so i caught this defect literally within minutes of updating to 3.27.2 -- this is what good operators should be doing. even so, it still took me several hours before finding this github issue to know what i could safely roll back to. customers that don't properly test or have unreasonable expectations that everything in the world is going to be problem free are likely going to be more frustrated than i was, as a matter of fact i wasn't frustrated by this at all, more "worried" that an irreversible change may have been made (not the case) and I would be doing a full cluster rebuild on my weekend (the "big yeet" that never happened, thankfully.) anyway, thanks for reading my wall of text. i'll not respond again, i just wanted to impress the value/importance of warning users on the release page to avoid some grief. EDIT: ++ @danudey since he looks uniquely-invested in release-related doco. |
IMHO since the new release wasn't done this week as promised, the release / home page must have a warning |
Calico v3.27.3 went out today after a minor delay, thank you your patience and please let us know if this issue is resolved when you get a chance to test it out. |
Thanks @caseydavenport |
Fresh cluster installations with the new calico version work fine, too! Thanks a lot :) |
It works now, I just found it a bit weird that apparently the fix for this was available 1.5 months ago but even though it was a critical issue, it was done in a regular release instead of a hotfix. |
I understand frustration with the time to release here, and I 100% want it to be quicker. But as an engineering team, especially one working on a free open-source project with a larger number of different users and use-cases, we have a variety of pressures on our time and priorities. We don't delay releases because we want to. |
We upgraded from calico 3.27.0 to 3.27.2 due to #8383. We upgraded by upgrading the tigera operator. Everything went smoothly except for calico-node on our arm servers. They go into CrashLoopBackOff with the following log entries:
Your Environment
The text was updated successfully, but these errors were encountered: