Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic not goes through routes at Fedora 33 #1614

Closed
rma945 opened this issue Sep 13, 2021 · 9 comments
Closed

Traffic not goes through routes at Fedora 33 #1614

rma945 opened this issue Sep 13, 2021 · 9 comments
Labels

Comments

@rma945
Copy link

rma945 commented Sep 13, 2021

What happened:
I have built a new AMI based on community Fedora 33 1.2 AMI with bootstrap scripts from this repository - https://github.com/awslabs/amazon-eks-ami and everything works fine, except the aws-vpc-cni. I have checked the nodes and found that pods are successfully created, the elastic interfaces are allocated and the routing tables are created successfully, but the pods can't ping or connect through a TCP to any external or local IP. But when I change the CNI plugin to calico - pods are able to reach any IP. The second worker node, based at AWS EKS AMI - works fine, the problem only with a Fedora-based AMI worker. Also, I was tried to switch the container runtime from docker to pure contained, but that not helps, the aws-vpc-cni still not works.

What you expected to happen:
Pods should be able to connect to local and internal services

How to reproduce it (as minimally and precisely as possible):
Get the Fedora 33 1.2 AMI, join it into the EKS cluster and add aws-vpc addon

Attached logs*
eks_i-0cca70aab4bdd2bc1_2021-09-13_0719-UTC_0.6.2.tar.gz

Anything else we need to know?:
Environment:

  • Kubernetes version: 1.21
  • CNI Version: v1.9.0-eksbuild.1
  • OS: Fedora 33 (Cloud Edition)
  • Kernel: 5.8.15-301.fc33.x86_64
@rma945 rma945 added the bug label Sep 13, 2021
@jayanthvn
Copy link
Contributor

Hi @rma945

Can we capture tcpdump on any of the nodes to verify if there is any issue with the ip tables or rules? Say when you ping from Pod-a to Pod-b

install tcpdump on the node
start capturing traffic on eth0 of node (assuming eth0 is the ENI for Pods)
tcpdump -i eth0 -w node_a_eth0.pcap

You can attach the pcap to the issue.

Also the logs attached above doesn't seem to download so can you reattach it?

@rma945
Copy link
Author

rma945 commented Sep 14, 2021

There is the PCAP file, and I have also re-upload the debug logs
node_a_eth0.zip

@jayanthvn
Copy link
Contributor

Can you confirm the source and destination Pod IPs and also did you ping or curl?

@rma945
Copy link
Author

rma945 commented Sep 16, 2021

Yes, any address except 127.0.01 can't be accessed. I have tried ping and curl to different IP addresses (pods, internal kube api, external services)

@jayanthvn
Copy link
Contributor

Can you please check if you are hitting this issue - #1600 (comment)

@RomanCherednikovAZ
Copy link

yes, I have already tried to disable the NetworkManager routing rules, as suggested at this issue but this not help. Also, in my case - looks like that the routes are added properly, but the routing is blocked by some reason

[root@ip-172-24-67-130 ~]# ip rule list
0:      from all lookup local
512:    from all to 172.24.67.25 lookup main  <- pod cni ip
1024:   from all fwmark 0x80/0x80 lookup main
32766:  from all lookup main
32767:  from all lookup default

@jayanthvn
Copy link
Contributor

jayanthvn commented Sep 24, 2021

Thanks for checking @RomanCherednikovAZ . Sorry I meant in the pcap file attached, can you please confirm the source and destination IPs ? If you have some bandwidth can you also please capture tcp dump on the destination side? Basically if we have the tcp dump on the source and destination nodes then we can co-relate where the drop is happening.

Also I see the CNI version in your logs is 1.7.10 so can you please confirm the cni version?

Sep 13 07:13:50 ip-172-24-68-126.definiens.local kubelet[1099]: {"level":"info","ts":"2021-09-13T07:13:50.020Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.10 ..."}

I do see this log in kubelet -

Sep 13 07:12:04 ip-172-24-68-126.definiens.local kubelet[1099]: I0913 07:12:04.930400    1099 cni.go:204] "Error validating CNI config list" configList="{\n  \"cniVersion\": \"0.3.1\",\n  \"name\": \"aws-cni\",\n  \"plugins\": [\n    {\n      \"name\": \"aws-cni\",\n      \"type\": \"aws-cni\",\n      \"vethPrefix\": \"eni\",\n      \"mtu\": \"9001\",\n      \"pluginLogFile\": \"/var/log/aws-routed-eni/plugin.log\",\n      \"pluginLogLevel\": \"DEBUG\"\n    },\n    {\n      \"type\": \"portmap\",\n      \"capabilities\": {\"portMappings\": true},\n      \"snat\": true\n    }\n  ]\n}" err="[failed to find plugin \"aws-cni\" in path [/opt/cni/bin]]"

@achevuru
Copy link
Contributor

@RomanCherednikovAZ Any update w.r.t the above logs? It appears that aws-node failed to copy the CNI binary to /opt/cni/bin. Can we check if there are any permission issues? You should be able to exec in to aws-node pod and try it out manually to see if it succeeds.

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/entrypoint.sh#L149

@rma945
Copy link
Author

rma945 commented Oct 15, 2021

Sorry for the confusion you, but rma945 and RomanCherednikovAZ = the same person. Yeah, I have checked that the CNI binaries were successfully added at the node, and the Network state for the node was changed to Ready. So the CNI itself - works fine, but looks like there were some issue with routing at the node. Anyway, at this moment we were anyway migrate our worker nodes back to the AWS EKS AMI.

@rma945 rma945 closed this as completed Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants