Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-node 3.27.2 fails to start on arm: libpcap.so.0.8: cannot open shared object file #8541

Closed
mzhaase opened this issue Feb 21, 2024 · 23 comments

Comments

@mzhaase
Copy link

mzhaase commented Feb 21, 2024

We upgraded from calico 3.27.0 to 3.27.2 due to #8383. We upgraded by upgrading the tigera operator. Everything went smoothly except for calico-node on our arm servers. They go into CrashLoopBackOff with the following log entries:

Calico-node: error while loading shared libraries: libpcap.so.0.8: cannot open shared object file: No such file or directory
Calico node failed to start

Your Environment

  • Calico 3.27.2
  • Kubernetes 1.29.1
@hjiawei
Copy link
Contributor

hjiawei commented Feb 21, 2024

Should fix by PR #8533 but unfortunately too late for v3.27.2 (was v3.27.1).

@mzhaase
Copy link
Author

mzhaase commented Feb 22, 2024

@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save?

@pmcgrath-mck
Copy link

Also waiting on this fix, not sure if you have a timeline for v3.27.3 release ?

@RyrieNorth
Copy link

@hjiawei is there any workaround using a newer container version for the node? If not, do you know if rolling back to 3.27.0 is save?

Try to use it?
6

@Tcharl
Copy link

Tcharl commented Feb 23, 2024

Same error on fedora 39 on arm (only available package: libcap-ng, libcap.so.2, libcap.so.2.48

@diranged
Copy link

While I'm just beating a dead horse - this is also broken on Bottlerocket on ARM hosts.

@matthewdupre
Copy link
Member

Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March.

@tibeer
Copy link

tibeer commented Mar 5, 2024

adding myself to get notifications ^^

@tmtiwari
Copy link

Waiting for a fix too, rolling back to 3.27.0 worked fine on Apple Silicon powered VM.

Rolling back to 3.27.0 should be safe. 3.27.3 with the fix is expected in late March.

@fraserds
Copy link

Spent hours! Trying to get Calico to work on ARM hosts, only to eventually find this and rollback to 27.0, this is a nasty bug!

@bensoille
Copy link

Nice this is in progress, adding myself for future notification too
Working on ubuntu arm64 after downgrade to 3.27.0

@lprimak
Copy link

lprimak commented Mar 25, 2024

This is high impact. Hope to see a release this week as promised. Thank you!

@wilson0x4d
Copy link

As an end-user; Pulling 3.27.0 into my cluster worked while waiting for the regular dev-test-release cycle to complete. Aside from troubleshooting this has zero impact for me. Would be nice if the 3.27.2 release description had noted that there is a defect for arm64 and shouldn't be used with any arm64 clusters, I could have saved some time (along with others), devs could consider updating release notes for defects like this as a matter of practice (ie. "Known Issues" and a link to anything that is "breaking" in the release.

I'm unlikely to adopt future builds aggressively and will be waiting for downstream projects to dogfood a release before adoption. It's unclear if the devs released this knowing there was a defect, or, if the testing in place is insufficient to catch such an obvious problem. The latter is at least correctable, the former creates trust issues.

As a developer; Rushing is how mistakes are made, I have no inclination to rush other developers. We get it when it's ready. 👍 Thanks to the devs for all their hard work on this project and related projects! 👏

@lprimak
Copy link

lprimak commented Mar 27, 2024

I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines.
Especially since this isn't documented in an appropriate place.

This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far).

Generally, a lot of these projects are half-baked and not mature.
Just an observation, not playing the blame game here.

Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem.

@josephrodriguez
Copy link

I disagree. This is a huge impact on new users. When a new user inevitably grabs the latest version for their clusters, they will tear their hair out for hours "why isn't my cluster working" on arm machines. Especially since this isn't documented in an appropriate place.

This isn't an issue with Calico alone, a lot of k8s ecosystem projects are like this. Currently, Rook's helm charts aren't working and it cost me 2 weeks to troubleshoot that (so far).

Generally, a lot of these projects are half-baked and not mature. Just an observation, not playing the blame game here.

Coming from the Java ecosystem where "everything works, even when always upgrading to latest" it just isn't the same with k8s ecosystem.

I strongly agree, was pretty crazy from my side to detect the issue in the cluster after one week.

If this issue is a very well known one, because there are several duplicated issues with the same problem, at least the documentation page should be updated to install the version 3.27.0 and skip the problematic version.

@caseydavenport
Copy link
Member

Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now. I can say that we are working on adding test environments running on ARM as we speak in order to prevent these types of issues in the future. It takes time and resources to build out full e2e runs for the various combinations of installer, cloud, distro, architecture, etc. So this will be an ongoing project to fill out the matrix. Thanks all for your help and patience to-date!

@wilson0x4d
Copy link

wilson0x4d commented Mar 31, 2024

Our internal testing of ARM64 builds in particular has been minimal and we have largely been relying on community support for verification and maintenance of these builds until now

testing efficacy aside, the thing that would really help everyone is that when a defect is discovered in a release leaving the release unusable, ie. no workaround other than to elect a different version, and a fix cannot be released within some reasonable time frame (you decide, but 24-48 hours seems fair IMO) the release page could be manually updated with a NOTE: Breaking change warning users, and if available, linking to the Issue(s) that discuss the break.

it's not an admittance of fault or a promise to fix, it's a respectable warning to the community that they may want to elect a different version if the Issue would affect them (or their downstream users.) it's better than taking a release down. it's better than combing through posts from frustrated users.

currently the release page for 3.27.2 makes no mention that it will not work with arm64 nodes, but this defect/Issue has been open for roughly a month. if the release notes for 3.27.2 linked to this issue myself and others could have saved time by electing a prior release and subscribing to this issue to wait for closure like normal people, and wouldn't have given it a second thought.

when i push new versions of critical components (and it doesn't get more critical than a CNI upgrade) i actually pull a couple nodes from my cluster (varying by OS version and platform architecture), reset them, and rejoin them, then verify logs and functionality. so i caught this defect literally within minutes of updating to 3.27.2 -- this is what good operators should be doing. even so, it still took me several hours before finding this github issue to know what i could safely roll back to. customers that don't properly test or have unreasonable expectations that everything in the world is going to be problem free are likely going to be more frustrated than i was, as a matter of fact i wasn't frustrated by this at all, more "worried" that an irreversible change may have been made (not the case) and I would be doing a full cluster rebuild on my weekend (the "big yeet" that never happened, thankfully.)

anyway, thanks for reading my wall of text. i'll not respond again, i just wanted to impress the value/importance of warning users on the release page to avoid some grief.

EDIT: ++ @danudey since he looks uniquely-invested in release-related doco.

@lprimak
Copy link

lprimak commented Mar 31, 2024

IMHO since the new release wasn't done this week as promised, the release / home page must have a warning

@caseydavenport
Copy link
Member

Calico v3.27.3 went out today after a minor delay, thank you your patience and please let us know if this issue is resolved when you get a chance to test it out.

@lprimak
Copy link

lprimak commented Apr 3, 2024

Thanks @caseydavenport
I just upgraded my cluster and it works.

@tibeer
Copy link

tibeer commented Apr 3, 2024

Fresh cluster installations with the new calico version work fine, too! Thanks a lot :)

@mzhaase
Copy link
Author

mzhaase commented Apr 3, 2024

It works now, I just found it a bit weird that apparently the fix for this was available 1.5 months ago but even though it was a critical issue, it was done in a regular release instead of a hotfix.

@mzhaase mzhaase closed this as completed Apr 3, 2024
@caseydavenport
Copy link
Member

I understand frustration with the time to release here, and I 100% want it to be quicker. But as an engineering team, especially one working on a free open-source project with a larger number of different users and use-cases, we have a variety of pressures on our time and priorities. We don't delay releases because we want to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests