Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking issue between nodes #6307

Closed
jvnvenu opened this issue Jul 11, 2024 · 7 comments
Closed

Networking issue between nodes #6307

jvnvenu opened this issue Jul 11, 2024 · 7 comments

Comments

@jvnvenu
Copy link

jvnvenu commented Jul 11, 2024

Environmental Info:
RKE2 Version:
v1.30.2 +rke2r1

Node(s) CPU architecture, OS, and Version:
3 Linux Nodes (RHEL 8), 2 Windows Nodes (Server 2019)
The above machines are VMWare vms

Cluster Configuration:
1 server (Linux)
4 agents (2 Linux and 2 Windows)
Flannel CNI

Describe the bug:
except first node all other nodes not able to access Kube APIs and unable to resolve DNS.
Between nodes directly it can reachable without any issues.

Expected behavior:
Should not see any connectivity issues between nodes

Actual behavior:
Lot of connectivity issues

@brandond
Copy link
Member

except first node all other nodes not able to access Kube APIs and unable to resolve DNS.

This indicates that VXLAN CNI traffic between nodes is being dropped. You should:

  1. Ensure that all of your nodes have the flannel vxlan port open to each other - see https://docs.rke2.io/install/requirements#inbound-network-rules
    UDP 4789 All RKE2 nodes All RKE2 nodes Flannel CNI with VXLAN
  2. Ensure that your traffic is not being dropped due to a well-known bug in the vmware virtual ethernet driver that affects vxlan traffic.
    You can work around this by running /usr/sbin/ethtool -K flannel.4096 tx-checksum-ip-generic off on the linux nodes.

@jvnvenu
Copy link
Author

jvnvenu commented Jul 11, 2024

great. it works. Thanks a lot. Will it be taken care in future version without this workaround.

@brandond
Copy link
Member

brandond commented Jul 11, 2024

No. The workaround disables hardware checksum offload, and imposes a significant hit to performance. It should only be used on nodes with buggy NIC drivers that fail to correctly calculate checksums for vxlan packets. Preferably you would use a different virtual interface type, or seek help from vmware to update the driver to a version not affected by this bug.

@dfaltum
Copy link

dfaltum commented Jul 12, 2024

@brandond So this issue is related to VMware (and some other drivers)? I thought this problem is about a kernel bug.

@brandond
Copy link
Member

brandond commented Jul 12, 2024

That was one cause of it, iirc there is another more common bug in one of the virtual nic drivers, which is why it's most commonly seen on VMware.

projectcalico/calico#4727 (comment)

@rabejens
Copy link

rabejens commented Dec 3, 2024

I am having the same issue, my Windows nodes' DNS times out. I am using VMs on Proxmox. I disabled the Windows firewall to no avail. Am I missing anything else? I just installed RKE2 as per the Quick Start.

@manuelbuil
Copy link
Contributor

There is an issue in Windows machines when you use the latest patches. Maybe you are affected by this? This is the workaround: microsoft/Windows-Containers#516 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants