Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove NetworkManager-cloud-setup RPM if present #1136

Closed
wants to merge 1 commit into from
Closed

Remove NetworkManager-cloud-setup RPM if present #1136

wants to merge 1 commit into from

Conversation

jdn5126
Copy link
Contributor

@jdn5126 jdn5126 commented Dec 19, 2022

Issue #, if available:
N/A

Description of changes:
This is very similar to #368 , which removed the ec2-net-utils package. A customer using this repo with a RHEL8 base image ran into pod networking issues caused by the presence of NetworkManager-cloud-setup RPM. There is a RedHat support article on this here: https://access.redhat.com/solutions/6319811

Though this only affected custom RHEL AMIs, this change is desired as we want to make sure this package never gets into EKS AMIs. Also, customers not using AL2 still depend on this repo.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done
Verified that yum command will not cause any issues. Customer also verified change in their environment.

@cartermckinnon
Copy link
Member

A customer using this repo with a RHEL8 base image ran into pod networking issues caused by the presence of NetworkManager-cloud-setup RPM.

Can you add details about what the issue was?

@jdn5126
Copy link
Contributor Author

jdn5126 commented Dec 23, 2022

A customer using this repo with a RHEL8 base image ran into pod networking issues caused by the presence of NetworkManager-cloud-setup RPM.

Can you add details about what the issue was?

Hey, yeah the main issue is that the nm-cloud-setup service will periodically delete ip rules installed by the VPC CNI. Rules that are needed to route packets from the container. The nm-cloud-setup utility installs a rule for every secondary IP configured on an ENI.

@cartermckinnon
Copy link
Member

Hm, makes sense to me. I don't know whether this is in AL2022 or not, but it seems fine to prevent it from surprising us in the future.

VPC CNI docs agree: https://github.com/aws/amazon-vpc-cni-k8s/blob/0fd2b308385ac4b1bcaa98cd23cd72b263c645d5/docs/troubleshooting.md#known-issues

NetworkManager-cloud-setup - The nm-cloud-setup service is incompatible with AWS VPC CNI. This service overwrites and clears ip rules installed for pods, which breaks pod networking. This package may be present in RHEL8 AMIs. See here for a RedHat thread explaining the issue. The symptom for this issue is the presence of routing table 30200 or 30400.


Though, I'm wondering if the VPC CNI could disable this systemd unit itself?

@jdn5126
Copy link
Contributor Author

jdn5126 commented Dec 25, 2022

It's not currently in AL2, but I was worried it could sneak in for a future release, yeah. The reason I removed it here was to follow the same pattern as what was done for ec2-net-utils. And then we get the added benefit of a smaller image to distribute.

@jdn5126
Copy link
Contributor Author

jdn5126 commented Jan 6, 2023

@cartermckinnon any further thoughts?

@jdn5126
Copy link
Contributor Author

jdn5126 commented Jan 12, 2023

Discussed offline and it is unlikely that this package will ever get into AL2. There are also workarounds for when a customer actually wants to use this package (aws/amazon-vpc-cni-k8s#1514 (comment)), and VPC CNI documents the issue in its troubleshooting doc (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md). For these reasons, I am dropping PR.

@jdn5126 jdn5126 closed this Jan 12, 2023
@cartermckinnon
Copy link
Member

I discussed this with the team, and we're kind of lukewarm on merging it.

  1. That package isn't in AL2 or AL2022, and we have no reason to think it'd be added in either. This template only explicitly supports AL.
  2. If someone else is using a different base OS (like the RHEL8 customer is), they may want this package installed. Not all users of our template are VPC CNI users.
  3. Removing the package doesn't seem like the only solution to the NetworkManager interaction -- I think you can also re-configure it: Some pods don't get host -> pod route on newer Fedora CoreOS version aws/amazon-vpc-cni-k8s#1514 (comment)
  4. The VPC CNI runs as a privileged pod on each node. It doesn't seem unreasonable for the VPC CNI (which may be used on different OS's), to handle this interaction itself (disable the systemd unit, etc.). This seems like a better approach versus individual distros handling it.

If we can get 4) done, that'd be ideal; users can forget about this. If not, we can add a note to the user guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants