You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
We're testing a Fedora CoreOS upgrade (from 33.20210426.3.0 to 34.20210529.3.0) on a test k8s cluster (non EKS), and we have some pods that are stuck in CrashLoopBackoff when the machine first boots because the route that goes from the host to the pod hasn't been set.
For instance the pod that has IP 10.102.128.4 has none of this route:
# ip route show table main
default via 10.102.128.1 dev ens5 proto dhcp metric 100
default via 10.102.128.1 dev ens6 proto dhcp metric 102
10.102.128.0/18 dev ens5 proto kernel scope link src 10.102.168.153 metric 100
10.102.128.0/18 dev ens6 proto kernel scope link src 10.102.154.115 metric 102
10.102.134.191 dev eniec4c2d67f18 scope link
10.102.163.181 dev eni3408eb5d67f scope link
10.102.165.80 dev eni4cbf85aa67f scope link
10.102.169.89 dev enife9c45d505d scope link
10.102.187.132 dev eni50cdd2eb92b scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
There's no entry for 10.102.128.4.
Note that killing the pod is sometimes enough to make the network work again.
Note that the plugin doesn't report any errors when setting the route:
{"level":"info","ts":"2021-06-17T13:02:20.236Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received CNI add request: ContainerID(7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c) Netns(/proc/9747/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=gatekeeper-system;K8S_POD_NAME=gatekeeper-audit-84964f86f-r9bqv;K8S_POD_INFRA_CONTAINER_ID=7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2021-06-17T13:02:20.236Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2021-06-17T13:02:20.245Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container 7e63d95cb137cd122a3fcebf04e1d2a25f9f65e4dcf05c9fb3bc23066f05d76c interface eth0: Success:true IPv4Addr:\"10.102.128.4\" UseExternalSNAT:true VPCcidrs:\"10.102.0.0/16\" "}
{"level":"debug","ts":"2021-06-17T13:02:20.245Z","caller":"routed-eni-cni-plugin/cni.go:194","msg":"SetupNS: hostVethName=eni1abcefcdbba, contVethName=eth0, netnsPath=/proc/9747/ns/net, deviceNumber=0, mtu=9001"}
{"level":"debug","ts":"2021-06-17T13:02:20.253Z","caller":"driver/driver.go:184","msg":"setupVeth network: disabled IPv6 RA and ICMP redirects on eni1abcefcdbba"}
{"level":"debug","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Setup host route outgoing hostVeth, LinkIndex 17"}
{"level":"debug","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Successfully set host route to be 10.102.128.4/0"}
{"level":"info","ts":"2021-06-17T13:02:20.254Z","caller":"driver/driver.go:178","msg":"Added toContainer rule for 10.102.128.4/32"}
In the past, we had an issue that looked like this, where systemd was changing the MAC address of the eni interface behind aws-cni back. But this doesn't look exactly like the same issue.
I'm currently clueless on how to troubleshoot this issue, can anyone offer some help?
After more troubleshooting it looks like it is a race condition between NetworkManager and the aws-cni plugin.
We excluded the eni* interface from NetworkManager and it looks like it doesn't exhibit this issue. We need to perform more tests to validate this solution.
Here's the config file /etc/NetworkManager/conf.d/aws-cni.conf:
What happened:
We're testing a Fedora CoreOS upgrade (from 33.20210426.3.0 to 34.20210529.3.0) on a test k8s cluster (non EKS), and we have some pods that are stuck in CrashLoopBackoff when the machine first boots because the route that goes from the host to the pod hasn't been set.
For instance the pod that has IP 10.102.128.4 has none of this route:
There's no entry for 10.102.128.4.
Note that killing the pod is sometimes enough to make the network work again.
Note that the plugin doesn't report any errors when setting the route:
In the past, we had an issue that looked like this, where systemd was changing the MAC address of the eni interface behind aws-cni back. But this doesn't look exactly like the same issue.
I'm currently clueless on how to troubleshoot this issue, can anyone offer some help?
Attach logs
eks_i-0fc36fa426a34ba90_2021-06-17_1558-UTC_0.6.2.tar.gz
What you expected to happen:
I expected that pod networking would work as it was in the previous version.
How to reproduce it (as minimally and precisely as possible):
Create a k8s cluster with Fedora CoreOS nodes and aws-cni.
Anything else we need to know?:
We need to test with an older kernel and/or systemd combination to check which ones creates this issue.
Environment:
kubectl version
): 1.18.6cat /etc/os-release
): Fedora CoreOS 34.20210529.3.0uname -a
): Linux ip-10-102-168-153 5.12.7-300.fc34.x86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP Wed May 26 12:58:58 UTC 2021 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: