aws-vpc-cni-init 1.7.3 Init:CrashLoopBackOff due to sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory #1241

mickael-ange · 2020-10-01T15:49:27Z

What happened:

I'm trying to upgrade aws-vpc-cni to 1.7.3 on an AWS EKS cluster version 1.17 with CentOS 7 self-managed nodes. My goal is to to use security groups for pods.

However aws-vpc-cni-init v1.7.3 does Init:CrashLoopBackOff due to sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory.

kubectl logs -n kube-system aws-node-zr8hs aws-vpc-cni-init
+ PLUGIN_BINS='loopback portmap bandwidth aws-cni-support.sh'
+ for b in '$PLUGIN_BINS'
+ '[' '!' -f loopback ']'
+ for b in '$PLUGIN_BINS'
+ '[' '!' -f portmap ']'
+ for b in '$PLUGIN_BINS'
+ '[' '!' -f bandwidth ']'
+ for b in '$PLUGIN_BINS'
+ '[' '!' -f aws-cni-support.sh ']'
+ HOST_CNI_BIN_PATH=/host/opt/cni/bin
+ echo 'Copying CNI plugin binaries ... '
Copying CNI plugin binaries ... 
+ for b in '$PLUGIN_BINS'
+ install loopback /host/opt/cni/bin
+ for b in '$PLUGIN_BINS'
+ install portmap /host/opt/cni/bin
+ for b in '$PLUGIN_BINS'
+ install bandwidth /host/opt/cni/bin
+ for b in '$PLUGIN_BINS'
+ install aws-cni-support.sh /host/opt/cni/bin
Configure rp_filter loose... 
+ echo 'Configure rp_filter loose... '
++ curl -X PUT http://169.254.169.254/latest/api/token -H 'X-aws-ec2-metadata-token-ttl-seconds: 60'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    56  100    56    0     0  56000      0 --:--:-- --:--:-- --:--:-- 56000
+ TOKEN=AQAEAFv0KS4wZjPJ8QABcGgRhwPHRLtDyNkZoqYeikKfRc465KwhJA==
++ curl -H 'X-aws-ec2-metadata-token: AQAEAFv0KS4wZjPJ8QABcGgRhwPHRLtDyNkZoqYeikKfRc465KwhJA==' http://169.254.169.254/latest/meta-data/local-ipv4
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    11  100    11    0     0  11000      0 --:--:-- --:--:-- --:--:-- 11000
+ HOST_IP=172.17.9.88
++ ip -4 -o a
++ grep 172.17.9.88
++ awk '{print $2}'
+ PRIMARY_IF=ens5
+ sysctl -w net.ipv4.conf.ens5.rp_filter=2
net.ipv4.conf.ens5.rp_filter = 2
+ cat /proc/sys/net/ipv4/conf/ens5/rp_filter
2
+ '[' false == true ']'
+ sysctl -w net.ipv4.tcp_early_demux=1
sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory

Indeed if I log into my worker node I got the same error with this command:

sysctl -w "net.ipv4.tcp_early_demux=1"
sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory

Has anyone hit this issue?

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.9-eks-4c6976", GitCommit:"4c6976793196d70bc5cd29d56ce5440c9473648e", GitTreeState:"clean", BuildDate:"2020-07-17T18:46:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

CNI Version: 1.7.3
OS (e.g: cat /etc/os-release):

cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Kernel (e.g. uname -a):

 uname -a
Linux ip-172-17-9-88.ap-northeast-1.compute.internal 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

mogren · 2020-10-01T16:10:06Z

@mickael-ange The v1.7.2 release does not have the tcp_early_demux, but everything else is the same, could you try with that version?

3.10 is a very old kernel, other things like --random-fully for conntrack will not work either.

SaranBalaji90 · 2020-10-01T16:18:50Z

[centos@ip-192-168-9-194 ~]$ sysctl -e -w "net.ipv4.tcp_early_demux=1"
[centos@ip-192-168-9-194 ~]$ sysctl -w "net.ipv4.tcp_early_demux=1"
sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory

[centos@ip-192-168-9-194 ~]$ uname -a
Linux ip-192-168-9-194.us-west-2.compute.internal 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

[centos@ip-192-168-9-194 ~]$

mogren · 2020-10-01T17:35:30Z

@mickael-ange Thanks for catching this. I've put some more details in #1242, but basically for kernels older than 4.6, using TCP health checks for pods using security groups for pods will not work. UDP or exec checks will still work, but since this is a kernel issue the, only other option is to use at least 4.6 (released May 2016).

mickael-ange · 2020-10-01T20:15:12Z

Thanks for your prompt answers and the quick PR.

CentOS 7 end of life is scheduled for June 30, 2024. CentOS 8 is only a year old and was not on AWS until recently. I have not planned to migrate all my CentOS 7 to CentOS 8 yet. I will try 1.7.2 tomorrow or maybe 1.7.4 ;). Thanks again.

mogren · 2020-10-01T20:18:04Z

@mickael-ange Tomorrow you should be able to try with v1.7.4, preparing a new release right now.

Note that TCP health checks will not work on CentOS 7 because of the old kernel (Early TCP demux was added in 3.6, the flag to disable it first in 4.6). To work around that you will have to use UDP or exec health checks. We will be sure to update the documentation around this.

Thanks again for reporting the issue!

mogren · 2020-10-01T20:38:47Z

Just released v1.7.4! 🚀

Give it a try with:

kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.4/config/v1.7/aws-k8s-cni.yaml

mickael-ange · 2020-10-01T21:11:21Z

1.7.4 can start now.

But I'm worry about:

using TCP health checks for pods using security groups for pods will not work. UDP or exec checks will still work..

IIUC pods configured with readinessProbe:httpGet will work only on Kernel >= 4.6 and with tcp_early_demux=1 when enabling SG for these pods?

mogren · 2020-10-01T21:18:39Z

@mickael-ange Yes, that is correct.

Since early tcp demux is a "feature" in the kernel TCP stack since 3.6, there is no way around it. The reason this happens is explained in length in #1212, but in short it happens because security groups are enforced on the interface, so we can't shortcut the traffic from kubelet running in the host network namespace directly to the pod having a security group. We need to send it out through eth0 and let it go back in through the pod ENI, passing the SG check. This is somewhat unusual, and trips up the kernel so when the response from the pod comes back, the kernel drops the packet.

Either you will need to use a newer kernel, or use UPD or exec health checks to use the per pod security group feature with probes.

mickael-ange · 2020-10-02T06:37:11Z

Thanks @mogren to summarize #1212's.

My ultimate goal is to prepare to migrate from EKS/EC2 self-managed workers to EKS/Fargate when security group for pods will be available on it. However, I'm wondering if is there a plan to support it on AWS Fargate? If not, I don't really need to use security group for pods on EC2 since we already use Calico as network policy engine.

SaranBalaji90 · 2020-10-02T10:31:25Z

aws/containers-roadmap#625 This might be the one you’re probably looking for.

mickael-ange added needs investigation question labels Oct 1, 2020

mogren added the bug label Oct 1, 2020

mogren mentioned this issue Oct 1, 2020

Ignore error on enabling TCP early demux for old kernels #1242

Merged

mogren removed needs investigation question labels Oct 1, 2020

This was referenced Oct 1, 2020

Ignore error on enabling TCP early demux for old kernels #1243

Merged

Release 1.7.4 #1244

Merged

mogren added this to the v1.7.4 milestone Oct 2, 2020

mogren closed this as completed Oct 2, 2020

jdn5126 mentioned this issue Nov 28, 2022

VPC-CNI minimal image builds #2146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-vpc-cni-init 1.7.3 Init:CrashLoopBackOff due to sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory #1241

aws-vpc-cni-init 1.7.3 Init:CrashLoopBackOff due to sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory #1241

mickael-ange commented Oct 1, 2020

mogren commented Oct 1, 2020

SaranBalaji90 commented Oct 1, 2020 •

edited

Loading

mogren commented Oct 1, 2020

mickael-ange commented Oct 1, 2020 •

edited

Loading

mogren commented Oct 1, 2020

mogren commented Oct 1, 2020

mickael-ange commented Oct 1, 2020

mogren commented Oct 1, 2020

mickael-ange commented Oct 2, 2020

SaranBalaji90 commented Oct 2, 2020

aws-vpc-cni-init 1.7.3 Init:CrashLoopBackOff due to sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory #1241

aws-vpc-cni-init 1.7.3 Init:CrashLoopBackOff due to sysctl: cannot stat /proc/sys/net/ipv4/tcp_early_demux: No such file or directory #1241

Comments

mickael-ange commented Oct 1, 2020

mogren commented Oct 1, 2020

SaranBalaji90 commented Oct 1, 2020 • edited Loading

mogren commented Oct 1, 2020

mickael-ange commented Oct 1, 2020 • edited Loading

mogren commented Oct 1, 2020

mogren commented Oct 1, 2020

mickael-ange commented Oct 1, 2020

mogren commented Oct 1, 2020

mickael-ange commented Oct 2, 2020

SaranBalaji90 commented Oct 2, 2020

SaranBalaji90 commented Oct 1, 2020 •

edited

Loading

mickael-ange commented Oct 1, 2020 •

edited

Loading