Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add DNS configuration & verification steps to the docs #80

Open
andy108369 opened this issue Mar 3, 2023 · 2 comments
Open

docs: add DNS configuration & verification steps to the docs #80

andy108369 opened this issue Mar 3, 2023 · 2 comments
Assignees
Labels
docs repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Mar 3, 2023

K8s DNS resolution breaks in pods when one uses DHCP / has bad DNS search domain is configured.

  • all pods getting the DNS search domain from their /etc/resolv.conf file (kubelet does this);
  • depending on the DNS search domains this can break the DNS resolution - DNS attempts would fail with SERVFAIL error (host google.com, dig google.com, nslookup google.com);
  • it was rather tricky to figure why DNS search was still slipping through into the /etc/resolv.conf file;
  • figured it was because accept-ra is enabled in the netplan by default => refs https://bugs.launchpad.net/netplan/+bug/1858503

The working netplan config:

root@node1:~# cat /etc/netplan/00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
  version: 2
  renderer: networkd
  ethernets:
    ens160:
      dhcp4: true
      dhcp4-overrides:
        use-domains: false
      # disable accept-ra, otherwise it will bring search domains to your /etc/resolv.conf
      # refs https://bugs.launchpad.net/netplan/+bug/1858503
      accept-ra: false
      nameservers:
        addresses: [8.8.8.8, 8.8.4.4]
        search: []

root@node1:~# netplan try
root@node1:~# netplan apply

root@node1:~# resolvectl domain ens160
Link 2 (ens160):

kubectl -n kube-system delete pods  --all

Also worth checking what Domains= is set to in /etc/systemd/resolved.conf file on the host. Restart systemd-networkd & systemd-resolved services if necessary. Use resolvectl domain, networkctl status | grep Search to verify. Also /run/systemd/network/10-netplan-eth0.network, grep DOMAINS /run/systemd/netif/state /run/systemd/netif/links/*.

We should document this case and give users the verification steps so they can verify their DNS is working properly once they set up the K8s cluster.

Alternative fix

Provider owner can also change dnsPolicy from Default to ClusterFirst for the coredns deployment & nodelocaldns daemonset which will fix this behavior even when bad DNS search domain is present in the /etc/resolv.conf file:

kubectl patch deployment coredns -n kube-system --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/dnsPolicy", "value": "ClusterFirst"}]'
kubectl -n kube-system delete pods -l k8s-app=kube-dns

kubectl patch daemonset nodelocaldns -n kube-system --type=json -p='[{"op": "replace", "path": "/spec/template/spec/dnsPolicy", "value": "ClusterFirst"}]'
kubectl -n kube-system delete pods -l k8s-app=nodelocaldns

More about the dnsPolicy -> https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
Also worth reading https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/

@andy108369 andy108369 added the docs label Mar 3, 2023
@andy108369 andy108369 self-assigned this Mar 3, 2023
@andy108369
Copy link
Contributor Author

It looks like the alternative fix alone is not enough.

I just had the issue where nodelocaldns would keep crashing in CrashLoopBackOff

# kubectl -n kube-system logs nodelocaldns-v5s45
2023/03/20 14:26:13 [INFO] Starting node-cache image: 1.21.1
2023/03/20 14:26:13 [INFO] Using Corefile /etc/coredns/Corefile
2023/03/20 14:26:13 [INFO] Using Pidfile 
2023/03/20 14:26:13 [ERROR] Failed to read node-cache coreFile /etc/coredns/Corefile.base - open /etc/coredns/Corefile.base: no such file or directory
2023/03/20 14:26:13 [INFO] Skipping kube-dns configmap sync as no directory was specified
cluster.local.:53 on 169.254.25.10
in-addr.arpa.:53 on 169.254.25.10
ip6.arpa.:53 on 169.254.25.10
.:53 on 169.254.25.10
[INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e
CoreDNS-1.7.0
linux/amd64, go1.16.8, 
[FATAL] plugin/loop: Loop (10.233.90.150:43697 -> 169.254.25.10:53) detected for zone "ip6.arpa.", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4300695667419388152.5105250931155992130.ip6.arpa."

The only fix was the removing a bad search domains by configuring netplan the way mentioned in the original post (disabling the domains config via DHCP) and restarting the kube-system pods.

@andy108369
Copy link
Contributor Author

andy108369 commented May 31, 2023

Final fix to the FATAL loop error / CrashLoopBackOff nodelocaldns pods

The issue is very well explained here kubernetes-sigs/kubespray#9948 (comment)

The ad-hoc quick fix

This is for quick verification, instead of waiting for the kubespray to finish which can take up to 1 hour long.

1. update the coredns config

forward . /etc/resolv.conf {

TO:

forward . 8.8.8.8 {

replace 8.8.8.8 with your preferred DNS servers

2. bounce coredns pods

kubectl -n kube-system delete pods -l k8s-app=kube-dns

3. bounce nodelocaldns pods

kubectl -n kube-system delete pods -l k8s-app=nodelocaldns

Permanent fix

Set the upstream_dns_servers in your inventory and kubespray your env again:

$ cd kubespray
kubespray$ grep -A2 upstream_dns_servers inventory/akash/group_vars/all/all.yml
upstream_dns_servers:
  - 8.8.8.8
  - 8.8.4.4
source venv/bin/activate
ansible-playbook -i inventory/akash/hosts.yaml -b -v cluster.yml

Verify

kubectl -n kube-system get cm coredns -o yaml | grep forward

Bounce the coredns and nodelocaldns pods in this order:

kubectl -n kube-system delete pods -l k8s-app=kube-dns

kubectl -n kube-system delete pods -l k8s-app=nodelocaldns

Verify all pods are in Running state:

kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system get pods -l k8s-app=nodelocaldns

@andy108369 andy108369 added the repo/provider Akash provider-services repo issues label Oct 17, 2023
@anilmurty anilmurty moved this to Up Next (prioritized) in Core Product and Engineering Roadmap Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs repo/provider Akash provider-services repo issues
Projects
Status: Up Next (prioritized)
Development

No branches or pull requests

1 participant