-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [request]: Nodelocal DNS Cache #303
Comments
After way too much time spent on this I think I got this working. I'd love a review from somebody who knows what they're doing tho 😅 I promise I'll create a pretty blog post or a PR to the GitHub AWS EKS docs. So, the
Not replacing those will lead to the lovely Now the question comes up: what to replace those values with? After way too much wasted time I found out that the amazing eksctl already supports Node-Local DNS caches! They do have a very nice PR with a description showing what to replace those values with eksctl-io/eksctl#550. TL;DR:
Applying the yaml will work then! Buuut using netshoot and running Running The cluster also needs to be changed to have Unfortunately I was using the Terrafrom community EKS module so this was not as simple. After some research it actually is pretty simple: just add Changes got applied, all existing nodes were manually terminated, new nodes came up. Redoing the above checks shows nodelocal is indeed used! 🎉 Now, all that said, I don't know much about networking. Does the above look sane? Can this be run in production? This comment confirms it to be safe in 1.12 even( I highly recommend reading the whole discussion there). |
I would also be interested in this feature being fleshed out/supported with EKS |
|
@ghostsquad that's my bad as I linked to the In |
Thank you for the response! |
I just got this set up myself and it's "working" great -- meaning, DNS requests are going to |
As promised, blog post about this is up on the AWS Containers Blog: EKS DNS at scale and spikeiness! It's basically my first post here with more details and helpful debugging hints. |
Was this blog post removed? Anyone still have this? |
Yes the blog post is not viewable anymore for me too |
Hi everyone - you can find instructions for installing the node-local DNS cache here: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ |
A copy of the blog can be found at: https://www.vladionescu.me/posts/eks-dns.html |
@otterley would you use the instructions you pointed out where the link to set up of the resources of Local DNS cache in Kubernetes leads to this file in the Master Branch : https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml Or from the eks-dns post:
Which leads to a different Branch to set up the resources of LocalDNS Cache: We are currently running EKS version 1.14. (There is no problem to upgrade to 1.15 if needed) |
Addons are not that tied to the k8s version. IIRC for NodeLocal DNS there is a pre-1.16 version( which requires kubelet changes) and a post-1.16 version( which requires no kubelet changes). Very high chances I am wrong on this as I haven't kept up to date with the changes. |
Hi, is there a way to use NodeLocal DNS cache with EKS managed node group? In that case it is impossible to set |
I followed the instructions closely but running netshoot and dig example.com I still see 172.20.0.10. nodelocaldns pods are running without crashing and the logs are not showing any errors:
I am running EKS 1.14 and using TF to control this cluster. I am using Any advice would be appreciated. |
@aimanparvaiz hm... That's odd. Let's try to debug it. Since the pod is still using
|
@Vlaaaaaaad thanks for responding. I am using image: k8s.gcr.io/k8s-dns-node-cache:1.15.3 and I got yaml from master branch. (this might be the issue) This is a new node, I updated TF and manually removed the older nodes. I am using this to specify new node
On this same node localdns is bound correctly. Used the same override flag to specify host. I do see: |
I grabbed yaml from release-1.15, unless I need to refresh nodes again, I am still seeing the same behavior. |
@aimanparvaiz did you find the root cause after all? I remember this moving to Slack, but no conclusion. Maybe your solution will help other people too 🙂 |
@Vlaaaaaaad I am not sure if I can safely say that I found the root cause. I deployed the latest version of Nodelocal DNS Cache, swapped out eks nodes with newer ones and the errors stopped. Thanks for all your help along with Chance Zibolski. Here is the link to complete slack conversation if anyone is interested: https://kubernetes.slack.com/archives/C8SH2GSL9/p1596646078276000. |
I'm using EKS's kubernetes 1.17, and I don't quite understand whether I can use the nodelocaldns yaml file from the master branch, or do I have to take the one from the release-1.17 branch. $ diff nodelocaldns-1.17.yaml nodelocaldns-master.yaml
100,102c100
< forward . __PILLAR__UPSTREAM__SERVERS__ {
< force_tcp
< }
---
> forward . __PILLAR__UPSTREAM__SERVERS__
124,125c122,126
< labels:
< k8s-app: node-local-dns
---
> labels:
> k8s-app: node-local-dns
> annotations:
> prometheus.io/port: "9253"
> prometheus.io/scrape: "true"
133a135,138
> - effect: "NoExecute"
> operator: "Exists"
> - effect: "NoSchedule"
> operator: "Exists"
136c141
< image: k8s.gcr.io/k8s-dns-node-cache:1.15.7
---
> image: k8s.gcr.io/dns/k8s-dns-node-cache:1.15.14 |
@Vlaaaaaaad any chance you know this ^^ ? |
Hey @dorongutman! Apologies, I am rather busy with some personal projects and I forgot to answer this 😞 My blog post is in desperate need of an update, and right now I lack the bandwidth for that. I hope I'll get to it before the end of the year, but we'll see. Hm... based on the updated NodeLocalDNS docs there are only a couple of variables that need changing. The other variables are replaced by NodeLocalDNS when it starts. Not at all confusing 😄
There also seems to be no need to set As I said, I've got no bandwidth to actually test the latest NodeLocalDNS --- this comment is just a bunch of assumptions from my side. If any of y'all has the time to test it and blog about it, I can help review! |
So I got NodeLocalDNS working but I needed coredns to serve as a backup.
169.254.20.10 is NodeLocalDNS That works but when I tested failover by spinning down NodeLocalDNS pods, nothing get resolved. I expected that it would lookup 10.100.0.10 but nothing is showing up. |
I hope this helps someone. I struggled with this for a while. As long as you're using your own EC2 worker nodes you have access to modify the kubelet args (which is a requirement for this). I personally use terraform for this, but this pretty much just creates the following launch configuration user data (notice the --kubelet-extra-args)
So as per above you're looking to add this to your kubelet so that ALL of your nodes will use this IP for DNS queries.
After that it's pretty cake, jam this configmap (below) into the yaml you can find here.
If you are using managed nodes you are SOL. As far as I know it's not on their roadmap to implement this on managed nodes (although they really should). |
My customer is interested in this feature. They Need to be able to set the DNS Config on the node for NodeLocal DNS. Extra-args not currently supported with the EKS optimized AMI. |
I have been running Nodelocal DNS Cache on EKS optimized AMI since 2019. |
You no longer even need to do the Just follow the instructions here and replace Or for the lazy just run: |
the following scrip should install node local dns: #!/usr/bin/env bash
# Docs: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
version="master"
curl -sLO "https://raw.githubusercontent.com/kubernetes/kubernetes/${version}/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml"
kubedns=$(kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP})
domain="cluster.local"
localdns="169.254.20.10"
# If kube-proxy is running in IPTABLES mode:
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/__PILLAR__DNS__SERVER__/$kubedns/g" nodelocaldns.yaml
# If kube-proxy is running in IPVS mode:
# sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/,__PILLAR__DNS__SERVER__//g; s/__PILLAR__CLUSTER__DNS__/$kubedns/g" nodelocaldns.yaml
kubectl create -f nodelocaldns.yaml but what if the cluster is already running |
From the architecture diagram here I imagine they |
Question regarding the NodeLocal cache setup. According to this blog post , when running a pod and running
However, when I open logging for the NodeLocal instance running for the node the pod is using, I do see the request going through NodeLocal:
Am I missing something? |
Is there anything different when we run this in Calico setup, rather than AWS CNI? |
@YuvalItzchakov according to this post, the newer versions of the app set an iptables output chain, you can verify via kubectl -n kube-system exec -it \
$(kubectl get po -n kube-system -l k8s-app=node-local-dns -o jsonpath='{.items[].metadata.name}') \
-- iptables -L OUTPUT |
I've done setup on EKS, but NodeLocal DNS pods do not resolve DNS queries, but cluster-wide CoreDNS does. I haven't replaced the nodes yet, just applied the config according to official docs and hoped the iptables magic would work. That is similar to @YuvalItzchakov problem described few comments ago. After checking out the @cilindrox iptables output I see the correct binding. However when checked iptables directly on node it does not use
Any idea why the iptables output would differ on K8s pod and on node? This seems to be the reason why the magic traffic routing to local IP for DNS resolution does not work. |
iptables rules are scoped to the network namespace in which they are created. Nodes run in their own network namespace (unless you configure |
It seems that my approach to observablity was wrong. I wanted to check if the DNS request was send to local IP or cluster IP, but apparently it listens on both on node:
So even DNS requests sent to 172.20.0.10 IP are being resolved locally. That means my previous attempt to observability (looking at IP of the resolver) was flawed and we need different one. |
@Vlaaaaaaad , I have followed your posts and successfully deploy the node local dns, thank you for your guide. As I am new to this, can I check with you despite that node local dns cached DNS records, but as soon as CoreDNS goes down (eg. scale to zero for example); query for any services in the cluster (eg. kubernetes.default.svc.cluster.local) will fail immediately. Is this something expected? Wonder why it fails instantly since it has cached record from kube-dns... Is there any setting for node local dns to response until record TTL expired? |
@tanvp112 if you want this behavior, you can edit the Docs for I verified that this works in my own installation. I think there are tradeoffs to this approach though. It may actually be better for DNS queries to fail instead of responding with potentially incorrect information. |
@mrparkers , thanks for guidance. Does this mean by default node local dns only cache external domain name and not the cluster domain (*.cluster.local)? The kubernetes document and GKE document suggest that CoreDNS/kube-dns will only be contacted when there's a cache missed... if true, query to node local dns such as kubernetes.default.svc.cluster.local should not fail instantly when CoreDNS/kube-dns is down. |
@tanvp112 I suspect if you're caching the result of |
@edify42, TTL should help here, unless |
@denniswebb I installed nodelocaldns cache to my EKS 1.27 cluster following the instructions tightly, but it didn't take over for some reason. My pods were still giving coredns service ip as nameserver. When I pass Thank you very much. |
Hello @denniswebb |
I followed all the steps mentioned here and I'm getting a strange behavior;
My NodelocalDNS configuration looks like most of you, double checked the replaced values (generated it using the docs Corefile: |
cluster.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10 172.20.0.10
forward . 172.20.0.10 {
force_tcp
}
health 169.254.20.10:8080
}
in-addr.arpa:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.20.0.10
forward . 172.20.0.10 {
force_tcp
}
}
.:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.20.0.10
forward . /etc/resolv.conf
}
Not sure if it's relevant, but, Using EKS version 1.28 and |
Those lookups come from pods in your cluster where the DNS options have not lowered ndots below 5. |
Hello All, Please help me resolve this issue. Below is
Test with
Log of
There are some timeout errors that
The configuration for node local dns is probably the same for all of you, but I'm attaching it. Expand
|
Is there a way to install node-local-dns on a managed eks nodegroup? |
Tell us about your request
I would like an officially documented and supported method for installing the Kubernetes Node Local DNS Cache Addon.
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Kubernetes clusters with a high request rate often experience high rates of failed DNS lookups. For example this affects us when using the AWS SDKs particularly with Alpine / musl-libc containers.
The Nodelocal DNS Cache aims to resolve this (together with kernel patches in 5.1 to fix a conntrack race condition).
Nodelocal DNS Addon
Kubeadm is aiming to support Nodelocal-dns-cache in 1.15. k/k #70707
Are you currently working around this issue?
Retrying requests at the application level which fail due to DNS errors.
Additional context
Kubernetes DNS Issues include:
single-request
option and it appears this will not be changed [2][3]Attachments
[0] https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Config.html#retryDelayOptions-property
[1] https://lkml.org/lkml/2019/2/28/707
[2] https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/
[3] https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
[4] https://www.openwall.com/lists/musl/2015/10/22/15
[5] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-limits
The text was updated successfully, but these errors were encountered: