Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS not working after reboot #2383

Open
hobyte opened this issue Jul 23, 2021 · 20 comments
Open

DNS not working after reboot #2383

hobyte opened this issue Jul 23, 2021 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@hobyte
Copy link

hobyte commented Jul 23, 2021

What happened:
I created a new kind cluster, then rebooted my computer. After reboot, the dns cannot resolve adresses

What you expected to happen:

dns can resolve adresses

How to reproduce it (as minimally and precisely as possible):

  • create a new kind cluster
  • test dns: it's working
  • reboot your machine (dont't stop docker before reboot)
  • test dns again:
#APISERVER=https://kubernetes.default.svc
#SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
#NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
#TOKEN=$(cat ${SERVICEACCOUNT}/token)
#CACERT=${SERVICEACCOUNT}/ca.crt
#curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
curl: (6) Could not resolve host: kubernetes.default.svc

Taken from https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#without-using-a-proxy

Anything else we need to know?:

  • dns pods are running
  • dns logs:
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.0
linux/amd64, go1.15.3, 054c9ae

dns lookup:

#nslookup kubernetes.default
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1
#nslookup kubernetes.default.svc
;; connection timed out; no servers could be reached

rslov.conf:

#cat /etc/resolv.conf 
search default.svc.cluster.local svc.cluster.local cluster.local fritz.box
nameserver 10.96.

Environment:

  • kind version: (use kind version): kind v0.11.1 go1.16.4 linux/amd64
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info): Client:
    Context: default
    Debug Mode: false

Server:
Containers: 5
Running: 2
Paused: 0
Stopped: 3
Images: 11
Server Version: 20.10.6-ce
Storage Driver: btrfs
Build Version: Btrfs v4.15
Library Version: 102
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: oci runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
Default Runtime: runc
Init Binary: docker-init
containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
init version:
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.3.18-59.16-default
Operating System: openSUSE Leap 15.3
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.552GiB
Name: Proxima-Centauri
ID: M6J5:OLHQ:FXVM:M7WG:2OUA:SKGW:UCF5:DWJZ:4M7T:YA2W:6FBT:DOLG
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

  • OS (e.g. from /etc/os-release): NAME="openSUSE Leap"
    VERSION="15.3"
    ID="opensuse-leap"
    ID_LIKE="suse opensuse"
    VERSION_ID="15.3"
    PRETTY_NAME="openSUSE Leap 15.3"
    ANSI_COLOR="0;32"
    CPE_NAME="cpe:/o:opensuse:leap:15.3"
    BUG_REPORT_URL="https://bugs.opensuse.org"
    HOME_URL="https://www.opensuse.org/"
@hobyte hobyte added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2021
@aojea
Copy link
Contributor

aojea commented Jul 27, 2021

I assume this snippet was a copy paste error, is missin the latest 2 digits for the ip address

#cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local fritz.box
nameserver 10.96.

are you using one node or mulitple nodes in the cluster?
clusters with multiple nodes doesn't handle the reboots

@BenTheElder BenTheElder added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 29, 2021
@faiq
Copy link
Contributor

faiq commented Aug 25, 2021

Hi also running into this issue! Although I'm not sure that this is caused by a restart for me necessarily.

$  kubectl run -it --rm --restart=Never busybox1 --image=busybox sh
If you don't see a command prompt, try pressing enter.
/ # nslookup kubernetes.default
Server:		10.96.0.10
Address:	10.96.0.10:53

** server can't find kubernetes.default: NXDOMAIN

*** Can't find kubernetes.default: No answer

/ # 

Here is what I get when I inspect the kind network

$ docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "7d815ef0d0c4adc297aa523aa3336ba89bc6d7212373d3098f12169618c16563",
        "Created": "2021-08-24T16:41:41.258730207-07:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "1c47d1b38fe7b0b75e71c21c150aba4d5110ade54d74e2f3db45c5d15d013c59": {
                "Name": "konvoy-capi-bootstrapper-control-plane",
                "EndpointID": "4b176452133a1881380cae8b3fc55963ec0427ee809bc1b678d261f3c1711931",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.driver.mtu": "1454"
        },
        "Labels": {}
    }
]
$ kind get nodes --name konvoy-capi-bootstrapper
konvoy-capi-bootstrapper-control-plane

output from ip addr

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
    link/ether 48:2a:e3:0a:7a:8c brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 30:24:32:43:a0:e9 brd ff:ff:ff:ff:ff:ff
    inet 192.168.42.76/24 brd 192.168.42.255 scope global dynamic noprefixroute wlp2s0
       valid_lft 83634sec preferred_lft 83634sec
    inet6 fe80::c3e2:7427:34c8:c265/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
25: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1400 qdisc noqueue state DOWN group default 
    link/ether 02:42:0c:bc:be:aa brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
28: br-7d815ef0d0c4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue state UP group default 
    link/ether 02:42:08:aa:2f:bb brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-7d815ef0d0c4
       valid_lft forever preferred_lft forever
    inet6 fc00:f853:ccd:e793::1/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::42:8ff:feaa:2fbb/64 scope link 
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link 
       valid_lft forever preferred_lft forever
30: vethba7cc46@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1454 qdisc noqueue master br-7d815ef0d0c4 state UP group default 
    link/ether 82:3a:43:df:a0:c1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::803a:43ff:fedf:a0c1/64 scope link 
       valid_lft forever preferred_lft forever

finally logs from a coredns pod

35365->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:36799->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:55841->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:38716->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:51342->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:46009->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:33070->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:34194->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:56925->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. AAAA: read udp 10.244.0.6:35681->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:42683->172.18.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 rhel82-tester-faiq2-apiserver-1592573265.us-west-2.elb.amazonaws.com.gateway.sonic.net. A: read udp 10.244.0.6:40842->172.18.0.1:53: i/o timeout

@AlmogBaku
Copy link
Member

AlmogBaku commented Nov 6, 2021

Hey, for us, the same issue happens after stopping/rebooting docker.
The same issue keeps reproducing with 2 different hosts @RomansWorks

Edit: we're running a single node setup, with the following config (copied from the website):

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

@BenTheElder
Copy link
Member

@AlmogBaku I still can't reproduce this in any of our environments. We need to know more about yours.

@AlmogBaku
Copy link
Member

That usually happens after a few times I'm closing the Docker.

Both me and @RomansWorks are using macOS

@alexandresgf
Copy link

alexandresgf commented Dec 8, 2021

I have the same issue here in my dev environment... the weird thing is when I connect into the pod using bash and try nslookup the DNS works as you can see in the image below:

image

But when I try it into my application it can not be solved and everything just doesn't work... and there is no error returned (that is weird too)

image

Although, if I use the POD IP it works normally...

image

My stack is:

  • Docker 20.10.11
  • K8s 1.21.1 (kindest/node default, but I already tested with all others supported versions)
  • Kind 0.11.1 (single cluster)

NOTES:

@aojea
Copy link
Contributor

aojea commented Dec 8, 2021

@alexandresgf please don't use screenshot, those are hard to read.

Is this problem happening after reboot or it never worked?

@alexandresgf
Copy link

alexandresgf commented Dec 10, 2021

@alexandresgf please don't use screenshot, those are hard to read.

Sorry for that!

Is this problem happening after reboot or it never worked?

At first it worked for a while, then sundenlly it happened after a reboot and the DNS never worked anymore even I removing the kind completely and doing a fresh install.

@brpaz
Copy link

brpaz commented Oct 17, 2022

I got a similar problem. I created a local kind cluster and it was working fine during the entire weekend, but today, when I rebooted my PC, the dns is completely down. I tried restart docker, and even manually the CoreDNS container, but doens´t fix the issue.

I got errors like this all over my containers:

 dial tcp: lookup notification-controller.flux-system.svc.cluster.local. on 10.96.0.10:53: read udp 10.244.0.3:52830->10.96.0.10:53: read: connection refused"

And it´s not only on the internal network. Even external requests are failing with the same error.

dial tcp: lookup github.com on 10.96.0.10:53: read udp 10.244.0.15:41035->10.96.0.10:53: read: connection refused'

Any idea?

@ben-foxmoore
Copy link

I observe the same issues when using KinD in a WSL2/Windows 11 environment. Example logs from the CoreDNS pod:

[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0202 14:14:20.711784       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable
E0202 14:14:22.917864       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: network is unreachable

@aojea
Copy link
Contributor

aojea commented Feb 2, 2023

pkg/mod/k8s.io/[email protected]

this is an old version, also wsl2/windows11 environments had some known issue, are you using latest version?

This bug is starting to become a placeholder, I wonder if we should close it an open more specific bugs, is not the same cluster not works after reboot in windows, that with podman, or with lima, ...

@ben-foxmoore
Copy link

Hi @aojea, which component are you saying is outdated?

I'm using kind 0.17.0 and I created the cluster using the command kind create cluster --image kindest/node:v1.21.14@sha256:9d9eb5fb26b4fbc0c6d95fa8c790414f9750dd583f5d7cee45d92e8c26670aa1 which is listed as a supported image in the 0.17.0 release.

I don't believe any of the WSL2 known issues are related to this? They all seem to be related to Docker Desktop behaviour.

@cobbgcall
Copy link

cobbgcall commented Nov 19, 2024

I had this issue, in a local cluster:
Podman - Kind.
Server: Podman Engine
Version: 5.0.0-dev-8a643c243
API Version: 5.0.0-dev-8a643c243
Go Version: go1.21.8

kind v0.20.0 go1.20.4

Image: registry.k8s.io/coredns/coredns:v1.10.1
Image ID: sha256:97e04611ad43405a2e5863ae17c6f1bc9181bdefdaa78627c432ef754a4eb108

to fix it, I update forward directive in the coredns configmap resource:

forward . 8.8.8.8 {
max_concurrent 1000
}

@maze88
Copy link

maze88 commented Jan 3, 2025

Somewhat of a niche case, but here is what caused and solved the issue for me (which occured in a KinD Kubernetes cluster)...

  • Checking the CoreDNS configuration file (the Corefile) via kubectl get configmap -n kube-system coredns -o yaml shows that all (.) DNS queries are handled via /etc/resolv.conf (forward . /etc/resolv.conf).
  • On my host I have a custom /etc/resolv.conf, that points directs my DNS queries to a custom local DNS server (a PiHole) via two addresses - the first being on my local network (10.0.0.0/24) and the second over a personal VPN (10.10.10.0/24). Like this:
    nameserver 10.0.0.12
    nameserver 10.10.10.31
    nameserver 1.1.1.1
    
  • On my host: The first nameserver entry drifted to an incorrect value (probably due to changes in upstream DHCP), and was no longer correct/reachable. I was not aware of this silent failure, since the host OS simply failed-over to the second option (and could have also to the third), and would continue to successfully resolve DNS queries.
  • In the Kubernetes cluster: CoreDNS utilizes only 1 of the nameservers from the host (more about this in K8s DNS debugging docs) - which happened to be the faulty one (from my previous point). Which what lead to the DNS query failures within the cluster.

Hopefully this helps someone else too!

@BenTheElder
Copy link
Member

Thanks @maze88 I bet this is the case for others as well.

@aojea perhaps another reason to switch to our own DNS proxy service at the node level instead of the iptables hacks ...

@aojea
Copy link
Contributor

aojea commented Jan 7, 2025

@maze88 the kind node does not uses that resolv.conf directly
docker adds its own dns server on the magic IP 127.0.0.11, then kind rewrites it to the host ip so pods does not fail having a loopback address as a resolv.conf

    chain DOCKER_OUTPUT {
                ip daddr 192.168.8.1 tcp dport 53 counter packets 0 bytes 0 dnat to 127.0.0.11:35987
                ip daddr 192.168.8.1 udp dport 53 counter packets 112 bytes 8858 dnat to 127.0.0.11:37345
        }

        chain DOCKER_POSTROUTING {
                ip saddr 127.0.0.11 tcp sport 35987 counter packets 0 bytes 0 snat to 192.168.8.1:53
                ip saddr 127.0.0.11 udp sport 37345 counter packets 0 bytes 0 snat to 192.168.8.1:53
        }

that kind of problem is not possible to fix from kind

@BenTheElder
Copy link
Member

In the Kubernetes cluster: CoreDNS utilizes only 1 of the nameservers from the host (more about this in K8s DNS debugging docs) - which happened to be the faulty one (from my previous point). Which what lead to the DNS query failures within the cluster.

Wait, that should not be happening, it should be using the embedded DNS resolver, are you using a custom KIND_EXPERIMENTAL_ETWORK_NAME or something like that? (~unsupported and likely to break things, really tempted to remove this)

@maze88
Copy link

maze88 commented Jan 7, 2025

In the Kubernetes cluster: CoreDNS utilizes only 1 of the nameservers from the host (more about this in K8s DNS debugging docs) - which happened to be the faulty one (from my previous point). Which what lead to the DNS query failures within the cluster.

Wait, that should not be happening, it should be using the embedded DNS resolver, are you using a custom KIND_EXPERIMENTAL_ETWORK_NAME or something like that? (~unsupported and likely to break things, really tempted to remove this)

I'm using kind v0.20.0 go1.20.4 linux/amd64, with this config file. The resulting cluster has the following CoreDNS configuration:

kubectl get cm -n kube-system coredns | grep forward
        forward . /etc/resolv.conf {

With a brief test (by fiddling with my /etc/resolv,confI couldn't recreate the issue, but it may also be related to it being 1AM here..! d:
if you like to instruct me to attempt any other tests speciffical

@BenTheElder
Copy link
Member

Hmm, v0.20 is a little old but I don't think we changed much related to this since.

That coreDNS config looks right, but /etc/resolv.conf here will be the resolv.conf inside the node, not the one on your host, and it should've been rewritten to have the docker DNS embedded resolver and NOT the resolvers from your host config.

https://docs.docker.com/engine/network/#dns-services (we use a custom network)

kind does some hackery to change the IP used for that resolver, but it should still be that one container-local resolver socket and not the host resolvers.

however, those host resolvers will be used indirectly via the docker daemon, so rather than it being coreDNS, it may be dockerd on the host that fails to use the secondary resolver, which is not something kind can fix, but would explain what you observed.

@BenTheElder
Copy link
Member

The embedded resolver docker provides is basically just a socket injected into containers attached to a network other than the default bridge which is then resolved in dockerd.

The intention in leveraging this in kind is:

  1. we get hostname resolution for the container names (which gives us a stable reference even if their IPs change)
  2. DNS resolution for uh "upstreams" / external addresses (versus kubernetes services) should respect all host settings, because it's actually happening on the host (in dockerd) when upstreams are resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

10 participants