docs: add DNS configuration & verification steps to the docs #80

andy108369 · 2023-03-03T21:51:17Z

K8s DNS resolution breaks in pods when one uses DHCP / has bad DNS search domain is configured.

all pods getting the DNS search domain from their /etc/resolv.conf file (kubelet does this);
depending on the DNS search domains this can break the DNS resolution - DNS attempts would fail with SERVFAIL error (host google.com, dig google.com, nslookup google.com);
it was rather tricky to figure why DNS search was still slipping through into the /etc/resolv.conf file;
figured it was because accept-ra is enabled in the netplan by default => refs https://bugs.launchpad.net/netplan/+bug/1858503

The working netplan config:

root@node1:~# cat /etc/netplan/00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
  version: 2
  renderer: networkd
  ethernets:
    ens160:
      dhcp4: true
      dhcp4-overrides:
        use-domains: false
      # disable accept-ra, otherwise it will bring search domains to your /etc/resolv.conf
      # refs https://bugs.launchpad.net/netplan/+bug/1858503
      accept-ra: false
      nameservers:
        addresses: [8.8.8.8, 8.8.4.4]
        search: []

root@node1:~# netplan try
root@node1:~# netplan apply

root@node1:~# resolvectl domain ens160
Link 2 (ens160):

kubectl -n kube-system delete pods  --all

Also worth checking what Domains= is set to in /etc/systemd/resolved.conf file on the host. Restart systemd-networkd & systemd-resolved services if necessary. Use resolvectl domain, networkctl status | grep Search to verify. Also /run/systemd/network/10-netplan-eth0.network, grep DOMAINS /run/systemd/netif/state /run/systemd/netif/links/*.

We should document this case and give users the verification steps so they can verify their DNS is working properly once they set up the K8s cluster.

Alternative fix

Provider owner can also change dnsPolicy from Default to ClusterFirst for the coredns deployment & nodelocaldns daemonset which will fix this behavior even when bad DNS search domain is present in the /etc/resolv.conf file:

kubectl patch deployment coredns -n kube-system --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/dnsPolicy", "value": "ClusterFirst"}]'
kubectl -n kube-system delete pods -l k8s-app=kube-dns

kubectl patch daemonset nodelocaldns -n kube-system --type=json -p='[{"op": "replace", "path": "/spec/template/spec/dnsPolicy", "value": "ClusterFirst"}]'
kubectl -n kube-system delete pods -l k8s-app=nodelocaldns

More about the dnsPolicy -> https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
Also worth reading https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/

The text was updated successfully, but these errors were encountered:

andy108369 · 2023-03-20T18:30:54Z

It looks like the alternative fix alone is not enough.

I just had the issue where nodelocaldns would keep crashing in CrashLoopBackOff

# kubectl -n kube-system logs nodelocaldns-v5s45
2023/03/20 14:26:13 [INFO] Starting node-cache image: 1.21.1
2023/03/20 14:26:13 [INFO] Using Corefile /etc/coredns/Corefile
2023/03/20 14:26:13 [INFO] Using Pidfile 
2023/03/20 14:26:13 [ERROR] Failed to read node-cache coreFile /etc/coredns/Corefile.base - open /etc/coredns/Corefile.base: no such file or directory
2023/03/20 14:26:13 [INFO] Skipping kube-dns configmap sync as no directory was specified
cluster.local.:53 on 169.254.25.10
in-addr.arpa.:53 on 169.254.25.10
ip6.arpa.:53 on 169.254.25.10
.:53 on 169.254.25.10
[INFO] plugin/reload: Running configuration MD5 = adf97d6b4504ff12113ebb35f0c6413e
CoreDNS-1.7.0
linux/amd64, go1.16.8, 
[FATAL] plugin/loop: Loop (10.233.90.150:43697 -> 169.254.25.10:53) detected for zone "ip6.arpa.", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4300695667419388152.5105250931155992130.ip6.arpa."

The only fix was the removing a bad search domains by configuring netplan the way mentioned in the original post (disabling the domains config via DHCP) and restarting the kube-system pods.

andy108369 · 2023-05-31T16:01:36Z

Final fix to the FATAL loop error / `CrashLoopBackOff` nodelocaldns pods

The issue is very well explained here kubernetes-sigs/kubespray#9948 (comment)

The ad-hoc quick fix

This is for quick verification, instead of waiting for the kubespray to finish which can take up to 1 hour long.

1. update the coredns config

forward . /etc/resolv.conf {

TO:

forward . 8.8.8.8 {

replace 8.8.8.8 with your preferred DNS servers

2. bounce coredns pods

kubectl -n kube-system delete pods -l k8s-app=kube-dns

3. bounce nodelocaldns pods

kubectl -n kube-system delete pods -l k8s-app=nodelocaldns

Permanent fix

Set the upstream_dns_servers in your inventory and kubespray your env again:

$ cd kubespray
kubespray$ grep -A2 upstream_dns_servers inventory/akash/group_vars/all/all.yml
upstream_dns_servers:
  - 8.8.8.8
  - 8.8.4.4

source venv/bin/activate
ansible-playbook -i inventory/akash/hosts.yaml -b -v cluster.yml

Verify

kubectl -n kube-system get cm coredns -o yaml | grep forward

Bounce the coredns and nodelocaldns pods in this order:

kubectl -n kube-system delete pods -l k8s-app=kube-dns

kubectl -n kube-system delete pods -l k8s-app=nodelocaldns

Verify all pods are in Running state:

kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system get pods -l k8s-app=nodelocaldns

andy108369 added the docs label Mar 3, 2023

andy108369 self-assigned this Mar 3, 2023

andy108369 mentioned this issue May 4, 2023

docs: multiple docs can be added (etcd, ceph backups; ceph administration; networking) #98

Open

andy108369 added the repo/provider Akash provider-services repo issues label Oct 17, 2023

anilmurty added this to Core Product and Engineering Roadmap Oct 17, 2023

anilmurty moved this to Up Next (prioritized) in Core Product and Engineering Roadmap Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add DNS configuration & verification steps to the docs #80

docs: add DNS configuration & verification steps to the docs #80

andy108369 commented Mar 3, 2023 •

edited

Loading

andy108369 commented Mar 20, 2023

andy108369 commented May 31, 2023 •

edited

Loading

docs: add DNS configuration & verification steps to the docs #80

docs: add DNS configuration & verification steps to the docs #80

Comments

andy108369 commented Mar 3, 2023 • edited Loading

Alternative fix

andy108369 commented Mar 20, 2023

andy108369 commented May 31, 2023 • edited Loading

Final fix to the FATAL loop error / CrashLoopBackOff nodelocaldns pods

The ad-hoc quick fix

1. update the coredns config

2. bounce coredns pods

3. bounce nodelocaldns pods

Permanent fix

andy108369 commented Mar 3, 2023 •

edited

Loading

andy108369 commented May 31, 2023 •

edited

Loading

Final fix to the FATAL loop error / `CrashLoopBackOff` nodelocaldns pods