Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 21.04 - vxlan failing to route #4188

Closed
clemenko opened this issue Oct 11, 2021 · 26 comments
Closed

Ubuntu 21.04 - vxlan failing to route #4188

clemenko opened this issue Oct 11, 2021 · 26 comments

Comments

@clemenko
Copy link

Environmental Info:
K3s Version:
1.21.5+k3s2

NAME       STATUS   ROLES                       AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION      CONTAINER-RUNTIME
k3s-85fc   Ready    control-plane,etcd,master   18m   v1.21.5+k3s2   134.122.5.183    <none>        Ubuntu 21.04   5.11.0-18-generic   containerd://1.4.11-k3s1
k3s-a648   Ready    <none>                      18m   v1.21.5+k3s2   104.131.58.161   <none>        Ubuntu 21.04   5.11.0-18-generic   containerd://1.4.11-k3s1
k3s-ad66   Ready    <none>                      18m   v1.21.5+k3s2   143.198.24.211   <none>        Ubuntu 21.04   5.11.0-18-generic   containerd://1.4.11-k3s1

Node(s) CPU architecture, OS, and Version:
Linux k3s-85fc 5.11.0-18-generic #19-Ubuntu SMP Fri May 7 14:22:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
As per above. 3 nodes, 1 master and 2 workers. This is a fresh install of k3s on Ubuntu 21.04.

Describe the bug:
With a fresh install of k3s on Ubuntu 21.04 vxlan is not working. AKA pods are not able to talk to each other across flannel. I have stopped and disabled ufw.

Steps To Reproduce:

  • build 3 - 21.04 - vms on Digitalocean.
  • use k3sup k3sup install --ip $server --user $user --k3s-extra-args '--no-deploy traefik --debug' --cluster --k3s-channel $k3s_channel --local-path ~/.kube/config
  • Installed K3s:

Expected behavior:
pods talk / ping across the nodes.

Actual behavior:
no ping

Additional context / logs:
journalct does not show any logs.
net.ipv4.ip_forward = 1 is enabled.

pings are not working.
root@k3s-8fea:~# ip a list cni0 | grep -w inet inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0

root@k3s-aaaa:~# ping 10.42.0.1
PING 10.42.0.1 (10.42.0.1) 56(84) bytes of data.
^C
--- 10.42.0.1 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1019ms

Updated both servers

Backporting
no . This seems to be tide to Ubuntu 21.04. This works on 20.10.

@brandond
Copy link
Member

brandond commented Oct 11, 2021

Can you try disabling IP checksum offload on your nodes? Its possible that the tx checksum offload bug hasn't been fixed on the kernel that Ubuntu is shipping in 21.04.

sudo ethtool -K cni0 tx-checksum-ip-generic off
sudo ethtool -K flannel.1 tx-checksum-ip-generic off

@clemenko
Copy link
Author

Nope same result.

root@k3s-8fea:~# ethtool -K cni0 tx-checksum-ip-generic off; ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksum-ip-generic: off
tx-tcp-segmentation: off [not requested]
tx-tcp-ecn-segmentation: off [not requested]
tx-tcp-mangleid-segmentation: off [not requested]
tx-tcp6-segmentation: off [not requested]
root@k3s-8fea:~# !ping
ping  10.42.1.1
PING 10.42.1.1 (10.42.1.1) 56(84) bytes of data.
^C
--- 10.42.1.1 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6129ms

root@k3s-8fea:~# ip a list cni0 | grep -w inet
    inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0

@brandond
Copy link
Member

brandond commented Oct 11, 2021

Hmm, it looks like I'm seeing the same thing on 21.10 as well. I see the responses come back to the originating host on the wire but they're dropped for some reason. Are you able to use host-gw or some other flannel backend instead, until this can be tracked down?

@brandond brandond added this to the v1.22.4+k3s1 milestone Oct 11, 2021
@clemenko
Copy link
Author

--flannel-backend=host-gw is not working either. granted i am using k3sup. How can i validate what backend is active on the master node?

@brandond
Copy link
Member

brandond commented Oct 11, 2021

host-gw only works if all your nodes are on the same subnet, which from looking at your node addresses it appears they are not.

The wireguard backend might be another good option for you, although you'd need to manually install the wireguard package on your nodes before using it. I've just confirmed that it works fine on my nodes, so it does appear to be something specific to vxlan.

@vadorovsky vadorovsky self-assigned this Oct 11, 2021
@manuelbuil manuelbuil self-assigned this Oct 15, 2021
@manuelbuil
Copy link
Contributor

I could reproduce this. The problem is that the mac addresses of flannel.1 interfaces are wrong in the bridge tables. In my case:

flannel.1_mac_host1 = 4e:b7:e3:9a:29:ed
flannel.1_mac_host2 = c2:38:72:fe:f9:a7

However, in the bridge forwarding table of host1, the mac address of flannel.1_mac_host2 is 16:d2:08:33:e5:29 and in host2, the mac address of flannel.1_mac_host1 is 86:13:87:80:f8:11. Those mac addresses don't belong to any interface.

As a consequence, I can see the traffic from pod_host1 to pod_host2 encapsulated and reaching eth0 on host2. But when it decapsulates, it searches for the wrong mac and the packet gets dropped.

I need to dig more but I'd say the bug is in flannel or bridge CNI binary (note that flannel binary uses the bridge binary for almost everything)

@Oats87
Copy link
Member

Oats87 commented Oct 15, 2021

@manuelbuil can you confirm whether the MAC is correct or not on the node annotations?

I'm wondering if we have a race or something that is failing to update the annotation properly

@manuelbuil
Copy link
Contributor

@manuelbuil can you confirm whether the MAC is correct or not on the node annotations?

I'm wondering if we have a race or something that is failing to update the annotation properly

I can confirm that the error is in flanneld, not in the binaries. Good point... I have just restarted the flanneld daemonset and it managed to write the correct macs, so it indeed might be a race

@brandond
Copy link
Member

brandond commented Oct 15, 2021

Is this by any chance a mac address randomization issue? Sounds kind of like systemd/systemd#13642 - this is for wifi interfaces but I wonder if for some reason it's doing the same thing to the vxlan interface.

Perhaps a more relevant link:

@manuelbuil
Copy link
Contributor

Is this by any chance a mac address randomization issue? Sounds kind of like systemd/systemd#13642 - this is for wifi interfaces but I wonder if for some reason it's doing the same thing to the vxlan interface.

Perhaps a more relevant link:

* [flannel cross node traffic does not work with latest systemd 242 due to a race flannel-io/flannel#1155](https://github.com/flannel-io/flannel/issues/1155)

I can confirm that's the problem

vadorovsky added a commit to vadorovsky/flannel that referenced this issue Oct 15, 2021
systemd 242+ assigns MAC addresses for all virtual devices which don't
have the address assigned already. That resulted in systemd overriding
MAC addresses of flannel.* interfaces. The fix which prevents systemd
from setting the address is to define the concrete MAC address when
creating the link.

Fixes: flannel-io#1155
Ref: k3s-io/k3s#4188
Signed-off-by: Michal Rostecki <[email protected]>
@vadorovsky
Copy link
Contributor

According to what @brandond and @manuelbuil wrote above, I consider that as a flannel but. Here is the attempt to fix it:

flannel-io/flannel#1485

manuelbuil pushed a commit to manuelbuil/flannel that referenced this issue Oct 21, 2021
systemd 242+ assigns MAC addresses for all virtual devices which don't
have the address assigned already. That resulted in systemd overriding
MAC addresses of flannel.* interfaces. The fix which prevents systemd
from setting the address is to define the concrete MAC address when
creating the link.

Fixes: flannel-io#1155
Ref: k3s-io/k3s#4188
Signed-off-by: Michal Rostecki <[email protected]>
@brandond brandond removed this from the v1.22.4+k3s1 milestone Oct 21, 2021
@Brice187
Copy link

Because #3863 linked to this issue:

On Debian Bullseye iptables is set to nftables so I had to change it:

$ update-alternatives --set iptables /usr/sbin/iptables-legacy

@manuelbuil
Copy link
Contributor

Because #3863 linked to this issue:

On Debian Bullseye iptables is set to nftables so I had to change it:

$ update-alternatives --set iptables /usr/sbin/iptables-legacy

Strange, I thought Debian 11 would be already using nftables. Can you confirm it does not?

@Brice187
Copy link

Brice187 commented Oct 29, 2021

As I said, Debian 11 uses nftables and I had to set it to legacy to work properly in my k3s clusters

@siretart
Copy link

This is kindof documented here: https://rancher.com/docs/k3s/latest/en/advanced/#enabling-legacy-iptables-on-raspbian-buster

I'd argue that this recommendation should be be moved from "Advanced Options and Configurations" to "FAQ" or "Known Issues", and expand that this applies to all of modern versions of "Raspian", "Ubuntu" and "Debian".

@ajvn
Copy link

ajvn commented Oct 31, 2021

I'm also getting hit by this when trying to deploy some workload that tries to reach public registry. In this specific case, this is the error I'm encountering when trying to deploy pihole:

failed to resolve reference "docker.io/pihole/pihole:2021.10.1": failed to do request: Head "https://registry-1.docker.io/v2/pihole/pihole/manifests/2021.10.1": dial tcp: lookup registry-1.docker.io: Try again

It also breaks DNS across the cluster, here's output from one of the nodes before/after the deployment:

64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=10 ttl=107 time=23.8 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=11 ttl=107 time=23.3 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=12 ttl=107 time=23.6 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=13 ttl=107 time=33.4 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=14 ttl=107 time=24.1 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=15 ttl=107 time=24.3 ms
64 bytes from 172.253.120.138: icmp_seq=16 ttl=107 time=31.5 ms
64 bytes from 172.253.120.138: icmp_seq=17 ttl=107 time=25.8 ms
64 bytes from 172.253.120.138: icmp_seq=18 ttl=107 time=26.9 ms
64 bytes from 172.253.120.138: icmp_seq=19 ttl=107 time=24.3 ms
64 bytes from 172.253.120.138: icmp_seq=20 ttl=107 time=26.0 ms

If I stop that, and retry pinging google, it will fail to resolve name resolution.

As soon as I remove that deployment, DNS resolution starts working again.
This happens with latest k3s version v1.21.6-rc2+k3s1 (254d2f69) which, at least from what I managed to find, uses updated version of flannel, 0.15.1.

Please let me know if I can provide any more output that could help with solving this.

Thank you.

@manuelbuil
Copy link
Contributor

As I said, Debian 11 uses nftables and I had to set it to legacy to work properly in my k3s clusters

So, even though Debian 11 uses nftables as default, k3s does not work properly with nftables and thus you must change iptables to legacy in order for k3s to work? If that's the case, could you please open a different issue? Thanks! And sorry for not understanding the issue :(

@manuelbuil
Copy link
Contributor

This is kindof documented here: https://rancher.com/docs/k3s/latest/en/advanced/#enabling-legacy-iptables-on-raspbian-buster

I'd argue that this recommendation should be be moved from "Advanced Options and Configurations" to "FAQ" or "Known Issues", and expand that this applies to all of modern versions of "Raspian", "Ubuntu" and "Debian".

Thanks for this

@Brice187
Copy link

Brice187 commented Nov 2, 2021

So, even though Debian 11 uses nftables as default, k3s does not work properly with nftables and thus you must change iptables to legacy in order for k3s to work? If that's the case, could you please open a different issue? Thanks! And sorry for not understanding the issue :(

the workaround with iptables runs pretty well, so this is not an mentionable issue (for me

@manuelbuil
Copy link
Contributor

I'm also getting hit by this when trying to deploy some workload that tries to reach public registry. In this specific case, this is the error I'm encountering when trying to deploy pihole:

failed to resolve reference "docker.io/pihole/pihole:2021.10.1": failed to do request: Head "https://registry-1.docker.io/v2/pihole/pihole/manifests/2021.10.1": dial tcp: lookup registry-1.docker.io: Try again

It also breaks DNS across the cluster, here's output from one of the nodes before/after the deployment:

64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=10 ttl=107 time=23.8 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=11 ttl=107 time=23.3 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=12 ttl=107 time=23.6 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=13 ttl=107 time=33.4 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=14 ttl=107 time=24.1 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=15 ttl=107 time=24.3 ms
64 bytes from 172.253.120.138: icmp_seq=16 ttl=107 time=31.5 ms
64 bytes from 172.253.120.138: icmp_seq=17 ttl=107 time=25.8 ms
64 bytes from 172.253.120.138: icmp_seq=18 ttl=107 time=26.9 ms
64 bytes from 172.253.120.138: icmp_seq=19 ttl=107 time=24.3 ms
64 bytes from 172.253.120.138: icmp_seq=20 ttl=107 time=26.0 ms

If I stop that, and retry pinging google, it will fail to resolve name resolution.

As soon as I remove that deployment, DNS resolution starts working again. This happens with latest k3s version v1.21.6-rc2+k3s1 (254d2f69) which, at least from what I managed to find, uses updated version of flannel, 0.15.1.

Please let me know if I can provide any more output that could help with solving this.

Thank you.

Could you open a different issue for this please?

@mdrahman-suse
Copy link

Validated the fix with k3s master commit: 86c6924 and performed same steps as #4259 (comment)

@clemenko
Copy link
Author

clemenko commented Nov 3, 2021

Commented on #4259

This is still broken for me.
Deploying with https://github.com/clemenko/k3s/blob/master/k3s.sh#L173

root@k3s-831b:~# systemd --version
systemd 247 (247.3-3ubuntu3)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid

root@k3s-831b:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

root@k3s-831b:~# uname -a
Linux k3s-831b 5.11.0-18-generic #19-Ubuntu SMP Fri May 7 14:22:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Then deployed the noted yaml from above.

kube:

clemair:clemenko k3s ( 167.99.124.208:6443 ) $ kubectl get nodes,pods -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node/k3s-831b Ready control-plane,etcd,master 9m47s v1.21.5+k3s2 167.99.124.208 Ubuntu 21.04 5.11.0-18-generic containerd://1.4.11-k3s1
node/k3s-a95e Ready 9m21s v1.21.5+k3s2 167.99.116.17 Ubuntu 21.04 5.11.0-18-generic containerd://1.4.11-k3s1
node/k3s-bda2 Ready 9m15s v1.21.5+k3s2 142.93.69.32 Ubuntu 21.04 5.11.0-18-generic containerd://1.4.11-k3s1

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/othertest-deploy-2glbg 1/1 Running 0 7m4s 10.42.1.7 k3s-a95e
pod/othertest-deploy-bdrbq 1/1 Running 0 7m4s 10.42.2.7 k3s-bda2
pod/othertest-deploy-blnnd 1/1 Running 0 7m4s 10.42.0.9 k3s-831b

and the pings

clemair:clemenko k3s ( 167.99.124.208:6443 ) $ kubectl exec -it othertest-deploy-2glbg -- bash
nginx@othertest-deploy-2glbg:/$ ping -c 1 -t 1 10.42.2.7
PING 10.42.2.7 (10.42.2.7) 56(84) bytes of data.
From 10.42.1.1 icmp_seq=1 Time to live exceeded

--- 10.42.2.7 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

nginx@othertest-deploy-2glbg:/$ ping -c 1 -t 1 10.42.0.9
PING 10.42.0.9 (10.42.0.9) 56(84) bytes of data.
From 10.42.1.1 icmp_seq=1 Time to live exceeded

--- 10.42.0.9 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

One of the things I noticed is that you are NOT using the upstream kernel. You are using the aws compiled one. Wonder if that has a fix. I am on DigitalOcean. This problem still applies to 21.10 as well.

@brandond
Copy link
Member

brandond commented Nov 4, 2021

QA marked this as fixed in the version that we're about to release, not the version that you're currently using. Please try again once we actually release the fixed version.

@clemenko
Copy link
Author

clemenko commented Nov 8, 2021

v1.22.3+k3s1 works!

@ajvn
Copy link

ajvn commented Nov 13, 2021

I'm also getting hit by this when trying to deploy some workload that tries to reach public registry. In this specific case, this is the error I'm encountering when trying to deploy pihole:

failed to resolve reference "docker.io/pihole/pihole:2021.10.1": failed to do request: Head "https://registry-1.docker.io/v2/pihole/pihole/manifests/2021.10.1": dial tcp: lookup registry-1.docker.io: Try again

It also breaks DNS across the cluster, here's output from one of the nodes before/after the deployment:

64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=10 ttl=107 time=23.8 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=11 ttl=107 time=23.3 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=12 ttl=107 time=23.6 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=13 ttl=107 time=33.4 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=14 ttl=107 time=24.1 ms
64 bytes from wd-in-f138.1e100.net (172.253.120.138): icmp_seq=15 ttl=107 time=24.3 ms
64 bytes from 172.253.120.138: icmp_seq=16 ttl=107 time=31.5 ms
64 bytes from 172.253.120.138: icmp_seq=17 ttl=107 time=25.8 ms
64 bytes from 172.253.120.138: icmp_seq=18 ttl=107 time=26.9 ms
64 bytes from 172.253.120.138: icmp_seq=19 ttl=107 time=24.3 ms
64 bytes from 172.253.120.138: icmp_seq=20 ttl=107 time=26.0 ms

If I stop that, and retry pinging google, it will fail to resolve name resolution.
As soon as I remove that deployment, DNS resolution starts working again. This happens with latest k3s version v1.21.6-rc2+k3s1 (254d2f69) which, at least from what I managed to find, uses updated version of flannel, 0.15.1.
Please let me know if I can provide any more output that could help with solving this.
Thank you.

Could you open a different issue for this please?

Done #4486

Thank you.

@0xD3
Copy link

0xD3 commented Nov 30, 2021

For anyone who stumbles on this issue while using the latest version of ubuntu 21.10, vxlan modules were moved by upstream to a separate package: linux-modules-extra-raspi. Installing these should solve your issues.

knisbet pushed a commit to gravitational/flannel that referenced this issue Jan 25, 2022
systemd 242+ assigns MAC addresses for all virtual devices which don't
have the address assigned already. That resulted in systemd overriding
MAC addresses of flannel.* interfaces. The fix which prevents systemd
from setting the address is to define the concrete MAC address when
creating the link.

Fixes: flannel-io#1155
Ref: k3s-io/k3s#4188
Signed-off-by: Michal Rostecki <[email protected]>
(cherry picked from commit 0198d5d)
knisbet pushed a commit to gravitational/flannel that referenced this issue Jan 25, 2022
* vxlan: Generate MAC address before creating a link

systemd 242+ assigns MAC addresses for all virtual devices which don't
have the address assigned already. That resulted in systemd overriding
MAC addresses of flannel.* interfaces. The fix which prevents systemd
from setting the address is to define the concrete MAC address when
creating the link.

Fixes: flannel-io#1155
Ref: k3s-io/k3s#4188
Signed-off-by: Michal Rostecki <[email protected]>
(cherry picked from commit 0198d5d)

* Concern only about flannel ip addresses

Currently flannel interface ip addresses are checked on startup when
using vxlan and ipip backends. If multiple addresses are found, startup
fails fatally. If only one address is found and is not the currently
leased one, it will be assumed that it comes from a previous lease and
be removed.

This criteria seems arbitrary both in how it is done and in its timing.
It may cause failures in situations where it might not be strictly
necessary like for example if the node is running a dhcp client that is
assigning link local addresses to all interfaces. It also might fail at
flannel unexpected restarts which are completly unrelated to
the external event that caused the unexpected modification in the
flannel interface.

This patch proposes to concern and check only ip address within the
flannel network and takes the simple approach to ignore any other ip
addresses assuming these would pose no problem on flannel operation.

A discarded but more agressive alternative would be to remove all
addresses that are not the currently leased one.

Fixes flannel-io#1060

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
(cherry picked from commit 33a2fac)

* Fix flannel hang if lease expired

(cherry picked from commit 78035d0)

* subnets: move forward the cursor to skip illegal subnet

This PR fixs an issue when flannel gets illegal subnet event in
watching leases, it doesn't move forward the etcd cursor and
will stuck in the same invalid event forever.

(cherry picked from commit 1a1b6f1)

* fix cherry-pick glitches and test failures

* disable udp backend tests since we don't actually have the udp backend in our fork

Co-authored-by: Michal Rostecki <[email protected]>
Co-authored-by: Jaime Caamaño Ruiz <[email protected]>
Co-authored-by: Chun Chen <[email protected]>
Co-authored-by: huangxuesen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants