Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel 5.1+ in EKS Amazon Linux AMI to resolve conntrack race conditions #357

Closed
jaygorrell opened this issue Oct 15, 2019 · 42 comments
Closed
Labels
bug Something isn't working

Comments

@jaygorrell
Copy link

What would you like to be added:
I would like the EKS AMIs to be on at least kernel version 5.1 to address the well-documented[0][1] conntrack race conditions.

Why is this needed:
There are some kernel-level race conditions with conntrack that frequently manifest as timed out DNS lookups in Kubernetes clusters. Two of the three race conditions have been patched in the kernel and are released in v5.1.

The third race condition is mitigated by Local DNS cache requested in another issue[2].

Attachments
[0] https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
[1] https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02
[2] aws/containers-roadmap#303

@owlwalks
Copy link

owlwalks commented Oct 19, 2019

kernel >= 4.19 will have these necessary patches:
http://patchwork.ozlabs.org/patch/937963/
http://patchwork.ozlabs.org/patch/1032812/

I just rolled up my ami here (rhel 8 - kernel 5.0) if anyone interested: https://github.com/localmeasure/amazon-eks-ami/tree/rhel

updated: iptables are replaced by nftables (backend) in rhel 8 or debian buster and it's playing up with kube-proxy, a better bet is sticking with debian stretch (iptables 1.6)

I made another ami (Linux 4.19.0-0.bpo.6-amd64 using debian-backports kernel)

ref:
kubernetes/kubernetes#71305

@midN
Copy link

midN commented Oct 22, 2019

How did this go out to GA?

@uvegla
Copy link

uvegla commented Oct 29, 2019

What is the state of this issue? Is there an ETA for the fix?

@rchernobelskiy
Copy link

It looks like you can get the 4.19 kernel on the EKS AMI via:
sudo amazon-linux-extras install kernel-ng && sudo reboot
More info:
https://aws.amazon.com/about-aws/whats-new/2019/07/amazon-linux-2-extras-provides-aws-optimized-versions-of-new-linux-kernels/

daniel-ciaglia pushed a commit to TierMobility/amazon-eks-ami that referenced this issue Nov 7, 2019
@rchernobelskiy
Copy link

rchernobelskiy commented Nov 7, 2019

From looking at this: torvalds/linux@4e35c1c
It seems like the second patch that @owlwalks mentioned isn't in 4.19
Justfyi @daniel-ciaglia
For others looking to resolve this issue, node-local dns seems to be working for us:
https://aws.amazon.com/blogs/containers/eks-dns-at-scale-and-spikeiness/

@jaygorrell
Copy link
Author

I'm not quite sure why, but NodeLocalDNS isn't working for our environment. It works -- does what it should do -- but doesn't mitigate the conntrack errors or even greatly reduce DNS latency.

I'm digging into coredns settings on the local pods now, but my understanding is that the ndots and search settings on k8s is causing a lot of failed lookups before the success that gets cached more aggressively. @rchernobelskiy can you share how your services refer to each other? It looks like if you use just the service name (without .default or any other suffix) the first lookup would be the cache hit. We often use <service>.default which means it's going upstream very often. I'm also seeing much higher latency on external requests that go through all the search domains due to the ndots: 5 setting.

@dougwettlaufer
Copy link

Hey @jaygorrell I worked with @rchernobelskiy to set this up for us so I can speak to that. First if you haven't already I'd suggest looking at this link for setting up nodelocaldns. I initially started out with master but it turned out it's not quite stable yet so best to use release-1.16.

Also the effectiveness of ndots and search will depend on the base image you're using. Since we are on primarily Alpine we didn't get much benefit out of tweaking those settings. We are seeing a slightly higher latency on external requests but the rest of our stuff that uses APP_NAME.NAMESPACE.svc.cluster.local is all responding pretty well. And when I say "slightly higher" its still pretty small with regard to overall volume.

@jaygorrell
Copy link
Author

Hey @dougwettlaufer, thanks for the response. I actually have NodeLocalDNS set up and working fine. I can confirm the local DNS cache is working with dig and such. I'm just not seeing improved results.

Like I mentioned before, the conntrack errors that this is supposed to mitigate are still happening with the occasional DNS request taking over 1s, just as before. They're always external DNS records though, which makes me think there's still a conntrack race happening somewhere with all the search domains.

My question about how you're requesting services was more about the way you address them. Are you using the full APP_NAME.NAMESPACE.svc.cluster.local format? If so, that's less than ndots: 5 which means it would still try every search domain before trying an outright request that succeeds. If you use just APP_NAME (or lower ndots to 4) it would succeed on the first request.

Anyways, I'm mostly just wondering if anyone else had to tweak the clusterlocaldns configmap to improve caching and such since we're still getting conntrack errors and occasional DNS latency. I'm going to try manually making it search external domains better but I don't think we should need to do this.

@jhcook-ag
Copy link

jhcook-ag commented Nov 25, 2019

I set the use-vc option in /etc/resolv.conf and resolves the issue.

     dnsConfig:
       options:
         - name: use-vc

@jaygorrell
Copy link
Author

I set the use-vc option in /etc/resolv.conf and resolves the issue.

     dnsConfig:
       options:
         - name: use-vc

This is only a partial fix and doesn't work for Alpine-based containers.

@isaacegglestone
Copy link

isaacegglestone commented Jan 10, 2020

@rchernobelskiy @owlwalks @daniel-ciaglia You probably know this already but for others who wander here wondering how to resolve this still. I just downloaded the latest amazon linux source code ( linux-4.19.84) onto a clean instance after updating the kernel on it. I do see the two patches are there in the code now.

Method used to check:
sudo amazon-linux-extras install kernel-ng && sudo reboot
After reboot*
yumdownloader --source kernel
mkdir tmp-kernel-dir
cd tmp-kernel-dir
rpm2cpio ..kernel-4.19.84-33.70.amzn2.src.rpm | cpio -idmv

cd tmp-kernel-dir/linux-4.19.84/net/netfilter

less the nf_conntrack_core.c

Validate against above two patches.

We are going to rollout new amis with this patch and I'll update here if we see the problem disappear or whatever our result was.

@vicaya
Copy link

vicaya commented Jan 21, 2020

Thanks @isaacegglestone! Any ETA for upgrading the AMI to use the new kernel? We need other features that would be enabled by kernel 4.19+ as well.

@imranismail
Copy link

imranismail commented Feb 10, 2020

This is badly needed. Any ETA for the patch to be GA?

@rtripat
Copy link
Contributor

rtripat commented Feb 10, 2020

We are working with the Amazon Linux kernel team to backport the fixes into 4.14 kernel. I will keep this thread updated with the progress. Appreciate the patience!

@abhay2101
Copy link

@rtripat are we also backporting all sockmap related patches which landed in 4.17 to ami 4.14?

@rtripat
Copy link
Contributor

rtripat commented Mar 17, 2020

The worker AMI release has the conntrack race condition fixes backported into 4.14 kernel. More specifically the following patches

Please give it a try!

@rtripat
Copy link
Contributor

rtripat commented Mar 18, 2020

@rtripat are we also backporting all sockmap related patches which landed in 4.17 to ami 4.14?

@abhay2101 : Can you please point me to the specific patches?

@tomfotherby
Copy link

tomfotherby commented Mar 18, 2020

We use the NodeLocal DNS Cache as recommended by AWS in their blog post titled EKS DNS at scale and spikeiness. The corefile specifies force_tcp because in my understanding, they want to avoid the conntack race conditions affecting udp. My question is, if the new AMI fixes the conntrack issue, would it be safe to go back to udp for DNS lookups? (We are seeing periodic timeouts from the upstream AWS DNS server using tcp for lookups)

@betinro
Copy link

betinro commented Mar 30, 2020

I've just put amazon-eks-node-1.15-v20200312 (ami-0e710550577202c55) on my us-west-2 EKS cluster and I see no difference comparing to v20200228. I see the same DNS fail rate. For some pods, if I just try a ping google.com I get about 70% fail rate.
According to kubectl get nodes I have indeed the latest image:
Amazon Linux 2 / 4.14.171-136.231.amzn2.x86_64 / docker://18.9.9

I had to manually update the Launch Template as it is not yet available in EKS to be rolled out.

I've also set ndots option to 1 and it is actually an improvement in the failure rate, but still unusable.
I don't know what else to try to make my EKS cluster usable. Funny think is that right now I am testing if EKS is a viable option to switch from kops.

@imranismail
Copy link

I've just put amazon-eks-node-1.15-v20200312 (ami-0e710550577202c55) on my us-west-2 EKS cluster and I see no difference comparing to v20200228. I see the same DNS fail rate. For some pods, if I just try a ping google.com I get about 70% fail rate.
According to kubectl get nodes I have indeed the latest image:
Amazon Linux 2 / 4.14.171-136.231.amzn2.x86_64 / docker://18.9.9

I had to manually update the Launch Template as it is not yet available in EKS to be rolled out.

I've also set ndots option to 1 and it is actually an improvement in the failure rate, but still unusable.
I don't know what else to try to make my EKS cluster usable. Funny think is that right now I am testing if EKS is a viable option to switch from kops.

My test yielded the same result as yours. No improvement in reducing getaddrinfo errors.

@betinro
Copy link

betinro commented Mar 31, 2020

Today I built a custom Debian based image which is supposed to solve this issue. I got inspired from this post and I have managed to build a Debian based amazon-eks-node-1.15-v20200331 AMI with Packer. This is what kubectl get nodes reports:
Debian GNU/Linux 9 (stretch) / 4.19.0-0.bpo.6-amd64 / docker://18.6.3

Unfortunately I see no improvement for DNS queries. They still fail at about the same rate.
What it's strange though is that it behaves quite different for different pods. All nslookups are fast and no timeout from some pods, while from some other pods I get more than 50% failures. And there is no considerable difference between these pods, just different micro-services but all based on the same Alpine image.
I started to suppose this is not actually a conntrack race issue, or at least not only this one. And what's more strange is that on the kops cluster there is no single issue and the nodes run Ubuntu 18.04.3 with kernel 4.15.0-1054-aws which is actually older than the Debian one. Both EKS and kops clusters run the very same YAMLs.
I've already spent three full days on this issue and found no solution for EKS to properly work.

@jqmichael
Copy link

@betinro @imranismail

Any chance we can scale down the # of pods sending requests to verify if DNS throttling could be part of the reason?

Each EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface.

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-limits

@imranismail
Copy link

imranismail commented Apr 5, 2020

@betinro @imranismail

Any chance we can scale down the # of pods sending requests to verify if DNS throttling could be part of the reason?

Each EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface.

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-limits

I have set a podAntiAffinity that ensures that it's hosted across different nodes.

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
  - topologyKey: kubernetes.io/hostname
    labelSelector:
      matchExpressions:
      - key: k8s-app
        operator: In
        values:
        - kube-dns

@palmbardier
Copy link

Any change in the last two weeks? Still can't upgrade my EKS cluster from 1.12 until this (these?) is fixed.

@rtripat
Copy link
Contributor

rtripat commented Apr 13, 2020

We are sorry you continue to face this issue. We have been working on a reproduction and have been unsuccessful so far in the latest EKS worker AMIs.

@palmbardier @betinro : Can you provide more detailed reproduction steps on your environment where this is happening?

@azhurbilo
Copy link

Issue reproduced several times today with v20200406:
new ec2 VMs UP but not connected to EKS

Example

<13>Apr 15 06:24:43 user-data: + for param in '$eks_node_sysctl'
<13>Apr 15 06:24:43 user-data: + /sbin/sysctl -w net.netfilter.nf_conntrack_max=500000
<13>Apr 15 06:24:43 user-data: sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory
[   12.513596] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [255]
[   12.532267] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[   12.534822] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
...
...
[   12.599714] cloud-init[3924]: Cloud-init v. 19.3-2.amzn2 finished at Wed, 15 Apr 2020 06:24:43 +0000. Datasource DataSourceEc2.  Up 12.59 seconds
[�[1;31mFAILED�[0m] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.
[   13.390772] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   13.400167] Bridge firewalling registered
[   13.478244] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[�[32m  OK  �[0m] Started Docker Application Container Engine.
         Starting Restore iptables...
[�[32m  OK  �[0m] Started Restore iptables.

@hugoprudente
Copy link

Have you guys tested the new Bottlerocket OS that uses kernel 5.4+ to check the conntrack race conditions?

I know that is still in beta, but it's good to have it documented, my cluster running it is not big enough for having such issue.

Bottlerocket OS 0.3.1            5.4.16
Bottlerocket OS 0.3.2            5.4.20

@palmbardier
Copy link

We are sorry you continue to face this issue. We have been working on a reproduction and have been unsuccessful so far in the latest EKS worker AMIs.

@palmbardier @betinro : Can you provide more detailed reproduction steps on your environment where this is happening?

Finally! Can't reproduce using amazon-eks-node-1.15-v20200409 - ami-0d0c1c9bb079158ae !
Thanks anyhow @rtripat !!

@rtripat
Copy link
Contributor

rtripat commented Apr 21, 2020

We are sorry you continue to face this issue. We have been working on a reproduction and have been unsuccessful so far in the latest EKS worker AMIs.
@palmbardier @betinro : Can you provide more detailed reproduction steps on your environment where this is happening?

Finally! Can't reproduce using amazon-eks-node-1.15-v20200409 - ami-0d0c1c9bb079158ae !
Thanks anyhow @rtripat !!

That's encouraging. Did you mean v20200406 though?

@imranismail
Copy link

Issue reproduced several times today with v20200406:
new ec2 VMs UP but not connected to EKS

Example

<13>Apr 15 06:24:43 user-data: + for param in '$eks_node_sysctl'
<13>Apr 15 06:24:43 user-data: + /sbin/sysctl -w net.netfilter.nf_conntrack_max=500000
<13>Apr 15 06:24:43 user-data: sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory
[   12.513596] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [255]
[   12.532267] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[   12.534822] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
...
...
[   12.599714] cloud-init[3924]: Cloud-init v. 19.3-2.amzn2 finished at Wed, 15 Apr 2020 06:24:43 +0000. Datasource DataSourceEc2.  Up 12.59 seconds
[�[1;31mFAILED�[0m] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.
[   13.390772] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   13.400167] Bridge firewalling registered
[   13.478244] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[�[32m  OK  �[0m] Started Docker Application Container Engine.
         Starting Restore iptables...
[�[32m  OK  �[0m] Started Restore iptables.

Same, I've faced the same issue. VM is up but not connected to the internet/master.

@rtripat
Copy link
Contributor

rtripat commented Apr 23, 2020

Issue reproduced several times today with v20200406:
new ec2 VMs UP but not connected to EKS
Example

<13>Apr 15 06:24:43 user-data: + for param in '$eks_node_sysctl'
<13>Apr 15 06:24:43 user-data: + /sbin/sysctl -w net.netfilter.nf_conntrack_max=500000
<13>Apr 15 06:24:43 user-data: sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory
[   12.513596] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [255]
[   12.532267] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[   12.534822] cloud-init[3924]: Apr 15 06:24:43 cloud-init[3924]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
...
...
[   12.599714] cloud-init[3924]: Cloud-init v. 19.3-2.amzn2 finished at Wed, 15 Apr 2020 06:24:43 +0000. Datasource DataSourceEc2.  Up 12.59 seconds
[�[1;31mFAILED�[0m] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.
[   13.390772] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   13.400167] Bridge firewalling registered
[   13.478244] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[�[32m  OK  �[0m] Started Docker Application Container Engine.
         Starting Restore iptables...
[�[32m  OK  �[0m] Started Restore iptables.

Same, I've faced the same issue. VM is up but not connected to the internet/master.

@imranismail , @azhurbilo EC2 instance starting successfully but failing to join the EKS cluster can be for various reasons like network connectivity from instance to cluster and is possibly unrelated to this issue. Have you tried looking at VPC configuration of your cluster?

@betinro
Copy link

betinro commented Apr 27, 2020

I've managed to fix my DNS related problem. Looks like it was not Conntrack issue related. I am not sure exactly what the problem itself was, but it was definitely related to the fact I have created my EKS cluster in an existing VPC. That VPC already contained a Kops K8s cluster plus some other workloads/instances, outside the cluster. Apparently they shouldn't cause a problem each other.
Once I've created my EKS cluster in a dedicated VPC it was all good, using managed node groups and the same AMI (1.15.10-20200228) which initially failed for me.

To optimize your cluster you may also consider using FQDN addresses, at least the internal ones. It also helps setting ndots DNS param to 1. These two optimizations alone helped me a lot in the old cluster which had DNS problem - they managed to lower the fail rate considerably.

Hopefully these tips will help somebody out there.

@debu99
Copy link

debu99 commented Oct 17, 2020

any update?

@rtripat
Copy link
Contributor

rtripat commented Dec 14, 2020

We were planning to release 5.4.x kernel with EKS 1.19 release followed by updating AMIs of all supported Kubernetes versions. However, an Amazon engineer found a kernel bug in versions 4.19.86 and higher. We have submitted a patch upstream and are waiting for that to be merged.

@martinezleoml
Copy link

Hi! Could you give us an update on the subject? Were are facing a lot of DNS resolution failures, causing almost 50% of our CI jobs (running on EKS) to fail.

We're using amazon-eks-node-1.18-v20201211 AMI.

@mmerkes
Copy link
Member

mmerkes commented Feb 16, 2021

For customers looking to use Linux kernel 5.4 as a solution to this, there are now two options.

1 Upgrade the kernel use amazon-linux-extras

sudo amazon-linux-extras install kernel-5.4
sudo reboot

2 Upgrade to Kubernetes version 1.19

EKS optimized AMIs 1.19+ will now come with the 5.4 kernel by default, so you can upgrade your clusters to 1.19 to use the latest AMI with 5.4.

@mmerkes
Copy link
Member

mmerkes commented Feb 16, 2021

Also, please let us know if upgrading the kernel doesn't resolve the issue for you.

@martinezleoml
Copy link

martinezleoml commented Feb 25, 2021

Thanks a lot!
Since we've upgraded kernel version a week ago to 5.4 (while running Kubernetes 1.18), we do not encounter DNS resolution issues anymore. 🎉

@mmerkes
Copy link
Member

mmerkes commented Feb 25, 2021

That's great news :D I'm going to resolve this issue since it's primarily focused on upgrading the kernel and seems to resolve the problems for some customers. However, anyone who's upgraded the kernel or using the 1.19 AMI, please open up a new issue if you're still seeing these problems.

@mmerkes mmerkes closed this as completed Feb 25, 2021
@mmerkes
Copy link
Member

mmerkes commented Mar 12, 2021

We've had a report from a customer that they are still seeing issues with conntrack race conditions on the 5.4 kernel. It may be related to customers using the CNI feature per pod security group. We've been working with the AmazonLinux team over the last couple of days and have identified patches that seem to fix the issue. Work is being done to merge these patches and get them out in the next AL2 release. Once a new kernel is released, customers can run yum update kernel and reboot their nodes to get the latest patch. Alternatively, EKS will release a new EKS optimized AMI once the patch is available, and you can replace your nodes with the latest AMIs.

We'll update here when the patch and the new AMIs are available.

@angadisachin
Copy link

With Kernel "5.4.95-42.163.amzn2.x86_64", the issue is not just related to DNS lookups. At a certain load especially on bigger instances, there were also significant packet drops with nf_conntrack errors in "/var/log/messages" of the host. Rolling back to EKS AMI for 1.18 (kernel 4.14) helped resolved our issue.

kernel: nf_conntrack: nf_conntrack: table full, dropping packet

@mmerkes
Copy link
Member

mmerkes commented May 28, 2021

@mmerkes mmerkes closed this as completed May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests