Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Issue #255

Open
MosheMoradSimgo opened this issue Feb 19, 2017 · 53 comments
Open

DNS Issue #255

MosheMoradSimgo opened this issue Feb 19, 2017 · 53 comments
Labels

Comments

@MosheMoradSimgo
Copy link

MosheMoradSimgo commented Feb 19, 2017

Hi,

We are running alpine (3.4) in a docker container over a Kubernetes cluster (GCP).

We have been seeing some anomalies where our thread is stuck for 2.5 sec.
After some research using strace we saw that DNS resolving gets timed-out once in a while.

Here are some examples:

23:18:27 recvfrom(5, "\f\361\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\243\213\360\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000045>
23:18:27 recvfrom(5, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000014>
23:18:27 clock_gettime(CLOCK_REALTIME, {1487114307, 714908396}) = 0 <0.000015>
23:18:27 poll([{fd=5, events=POLLIN}], 1, 2499) = 0 (Timeout) <2.502024>

09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, "\354\211\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\1\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\30\220\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000041>
09:04:27 recvfrom(5<UDP:[0.0.0.0:36148]>, 0x7ffec3d9b0b0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
09:04:27 clock_gettime(CLOCK_REALTIME, {1487149467, 555317749}) = 0 <0.000008>
09:04:27 poll([{fd=5<UDP:[0.0.0.0:36148]>, events=POLLIN}], 1, 2498) = 0 (Timeout) <2.499671>


09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, " B\201\200\0\1\0\1\0\0\0\0\2db\6devone\5*****\3net\0\0\1\0\1\300\f\0\1\0\1\0\0\0\200\0\4h\307\16N", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 53 <0.000011>
09:18:47 recvfrom(5<UDP:[0.0.0.0:47282]>, 0x7ffdd0e1fb90, 512, 0, 0x7ffdd0e1f640, 0x7ffdd0e1f61c) = -1 EAGAIN (Resource temporarily unavailable) <0.000008>
09:18:47 clock_gettime(CLOCK_REALTIME, {1487150327, 679292144}) = 0 <0.000005>
09:18:47 poll([{fd=5<UDP:[0.0.0.0:47282]>, events=POLLIN}], 1, 2497) = 0 (Timeout) <2.498797>

And a good example:

08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, "\20j\201\203\0\1\0\0\0\1\0\0\2db\6devone\5*****\3net\3svc\7cluster\5local\0\0\34\0\1\7cluster\5local\0\0\6\0\1\0\0\0<\0D\2ns\3dns\7cluster\5local\0\nhostmaster\7cluster\5local\0X\244\n\200\0\0p\200\0\0\34 \0\t:\200\0\0\0<", 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.3.240.10")}, [16]) = 148 <0.000014>
08:22:25 recvfrom(5<UDP:[0.0.0.0:59162]>, 0x7ffec3d9aeb0, 512, 0, 0x7ffec3d9ab60, 0x7ffec3d9ab3c) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>
08:22:25 clock_gettime(CLOCK_REALTIME, {1487146945, 638264715}) = 0 <0.000010>
08:22:25 poll([{fd=5<UDP:[0.0.0.0:59162]>, events=POLLIN}], 1, 2498) = 1 ([{fd=5, revents=POLLIN}]) <0.000010>

In the past we already had some issues with DNS resolving in older an version(3.3), which have been resolved since we moved to 3.4 (or so we thought).

Is this a known issue?
Does anybody have a solution / workaround / suggestion what to do?

Thanks a lot.

@Sartner
Copy link

Sartner commented Feb 25, 2017

Have the same issue
Alpine: 3.5
Docker: 1.13.1-cs2

/ # time ping -c 1 dev11
PING dev11 (10.1.100.11): 56 data bytes
64 bytes from 10.1.100.11: seq=0 ttl=63 time=0.211 ms

--- dev11 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.211/0.211/0.211 ms
real    0m 2.50s
user    0m 0.00s
sys     0m 0.00s

@rawat-he
Copy link

Hi,

With the latest version (3.5), I am experiencing below error.

fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/community: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.5/community/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
ERROR: http://dl-4.alpinelinux.org/alpine/v3.5/main: DNS lookup error
fetch http://dl-4.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-4.alpinelinux.org/alpine/v3.3/main/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
  bash (missing):
    required by: world[bash]
  ca-certificates (missing):
    required by: world[ca-certificates]
  curl (missing):
    required by: world[curl]

Can anyone please help me in resolving it and moving forward

Thanks

@andyshinn
Copy link
Contributor

The latter two comments don't sound like the same issue. This seems like a Kubernetes specific thing. Do you know if it happens to only Alpine containers or does it affect others as well? I've heard of intermittent DNS resolving issues in Kubernetes. But they were not specific to Alpine.

@c24w
Copy link

c24w commented Jun 2, 2017

We're seeing slow DNS resolution in alpine:3.4 (not in Kubernetes):

$ time docker run --rm alpine:3.4 nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve    

Name:      google.com        
Address 1: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 2: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f78.1e100.net         
Address 4: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net

real    0m2.996s             
user    0m0.010s             
sys     0m0.005s  

Versus Busybox:

$ time docker run --rm busybox nslookup google.com
Server:    10.108.88.10      
Address 1: 10.108.88.10      

Name:      google.com        
Address 1: 2a00:1450:4009:814::200e lhr25s13-in-x0e.1e100.net
Address 2: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 3: 216.58.204.78 lhr25s13-in-f14.1e100.net         
Address 4: 216.58.204.78 lhr25s13-in-f14.1e100.net

real    0m0.545s             
user    0m0.011s             
sys     0m0.007s

Not sure what the null error suggests, but it might be related!

Docker version 17.05.0-ce, build 89658be

@mpashka
Copy link

mpashka commented Aug 3, 2017

I have an issue with DNS resolving in alpine.
I have /etc/resolv.conf config with several search suffixes (6 suffixes). And during DNS resolving I see that my DNS server answers only first 6 or 7 requests (this is DNS DoS protection). But according to strace output alpine does 2 requests for each search suffix.

Ubuntu docker image doesn't have this problem - it does only one request for each name suffix.

So is it possible to fix this behaviour and make only 1 request to DNS server for each domain name suffix. This is important because kubernetes usually put 3 search suffixes. So if we have more than one our own search suffixes and we have DNS server that limits requests from single IP than most likely we get DNS resolution problem.

@justlooks
Copy link

yes ,latest alpine image has problem in DNS resolve ,all my app image build on alpine have same problem on kubernetes v1.7.0


[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup heapster.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      heapster.kube-system
Address 1: 10.100.249.248 heapster.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup http-svc.kube-system
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      http-svc.kube-system
Address 1: 10.102.217.7 http-svc.kube-system.svc.cluster.local
[root@k8s-master nfstest]# kubectl exec -it testme --namespace demo  -- nslookup ftpserver-service.demo
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'ftpserver-service.demo'

@mpashka
Copy link

mpashka commented Aug 11, 2017

During my investigations I've found that I have a problem with my DNS server.
Some time ago alpine didn't support resolv.conf options 'search' and 'domains'. But that is not the case now. They also claim they do resolving in parallel and thus results can differ. But this is not the case for me also.
I've found that alpine makes 2 requests because one is for ipv4 (A record) and other is for ipv6 (AAAA record).
My trouble is related to DNS server itself. If there are several search domains in resolv.conf and for some of that domains DNS server reports 'Server failure' (RCODE = 2) then alpine retries this name. If DNS server reports 'No such name' (RCODE = 3) then alpine continues with next search domain. Ubuntu on the other hand doesn't treat 'Server failure' (RCODE = 2) as DNS server failure and just coninues to fetch other search domains.
You can check DNS server rcode result for some specific dns name using command
# dig @<dns_server> dns_name_to_check
and check 'status:' field - it can be NXDOMAIN (which is 'No such name' RCODE = 3) or SERVFAIL.
BTW nslookup operates in the same manner. It respects RCODE and stopps if DNS server responce 'Server failure' (RCODE = 2)

chrishiestand added a commit to zesty-io/redis-k8s-statefulset that referenced this issue Oct 4, 2017
testing this without alpine because alpine bug might be
 causing issue where redis does not resolve new ip address:
 gliderlabs/docker-alpine#255

* also pin to 4.0
chrishiestand added a commit to zesty-io/redis-k8s-statefulset that referenced this issue Oct 4, 2017
testing this without alpine because alpine bug might be
 causing issue where redis does not resolve new ip address:
 gliderlabs/docker-alpine#255

* also pin to 4.0
chrishiestand added a commit to zesty-io/redis-k8s-statefulset that referenced this issue Oct 4, 2017
testing this without alpine because alpine bug might be
 causing issue where redis does not resolve new ip address:
 gliderlabs/docker-alpine#255

* also pin to 4.0
@zq-david-wang
Copy link

zq-david-wang commented Apr 10, 2018

I tried on alpine-docker 3.7, with /etc/resolv.conf as follow:

nameserver 10.254.0.100
search  localdomain  somebaddomain
options ndots:5

My DNS server "10.254.0.100" manage its own domain 'localdomain' while forward query of other domain to some external dns server.
Then when I query google.com, alpine dnsclient would

  1. try google.com.localdomain, and get a "NXDomain" response
  2. try google.com.somebaddomain, but get a "Refused" response, but after receive a "Refused/SERVFAIL" response, alpine client would keep retry "google.com.somebaddomain", resulting in the final failure.

I also try centos/ubuntu docker image, those dns client would giveup those "Refused/Servfail" response and keep next trial of "google.com" and got an expected response.

Is it the secure/expect reaction to retry same dns after receiving "Refused/Servfail" response or it is a bug in alpine.

@KIVagant
Copy link

KIVagant commented May 11, 2018

We got probably the same issue. Two different containers running in the same cluster in parallel:

  • image with 3.5.2 works normal, AWS DNS resolves in 0.01s
  • image with 3.7.0 has big lag, DNS could be resolved in 5 seconds or could not be resolved at all.

@zioalex
Copy link

zioalex commented May 25, 2018

For the DNS delay try to add the line:
options single-request
in the resolv.conf
See https://wiki.archlinux.org/index.php/Domain_name_resolution#Hostname_lookup_delayed_with_IPv6

@joshbenner
Copy link

I don't think musl (which is used by Alpine) has the single-request resolver option.

@zq-david-wang
Copy link

I tried following changes, it seems work. (Tried on my cluster and push to davidzqwang/alpine-dns:3.7)

diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index 209c20f..abb7da5 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -202,7 +202,7 @@ static int name_from_dns_search(struct address buf[static MAXADDRS], char canon[
                        memcpy(canon+l+1, p, z-p);
                        canon[z-p+1+l] = 0;
                        int cnt = name_from_dns(buf, canon, canon, family, &conf);
-                       if (cnt) return cnt;
+                       if (cnt > 0 || cnt == EAI_AGAIN) return cnt;
                }
        }

@runephilosof
Copy link

I have tested 3.6, 3.7 and edge and all are affected by https://bugs.busybox.net/show_bug.cgi?id=675.
Alpine 3.7, and edge use BusyBox v1.27.2 (2017-12-12 10:41:50 GMT) multi-call binary., but if I pulll busybox:1.27.2 and test nslookup, it doesn't have the error.
So I am not sure if just upgrading busybox will fix the issue.
The busybox bug report hints that the libc in use will influence the problem.

@krikri90
Copy link

fetch http://mirror.ps.kz/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
ERROR: http://mirror.ps.kz/alpine/v3.8/main: DNS lookup error
WARNING: Ignoring APKINDEX.1b054110.tar.gz: No such file or directory
fetch http://mirror.ps.kz/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
ERROR: http://mirror.ps.kz/alpine/v3.8/community: DNS lookup error
WARNING: Ignoring APKINDEX.ce38122e.tar.gz: No such file or directory

Getting above error. How to fix it

@sadok-f
Copy link

sadok-f commented Aug 22, 2018

Hi,

We're running a couple of Docker container on AWS EC2, the images based on Alpine3.7.
The DNS resolution is very slow, here an example:

time nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 216.58.207.174 muc11s04-in-f14.1e100.net
Address 2: 2a00:1450:4016:80a::200e muc11s12-in-x0e.1e100.net
real    0m 2.53s
user    0m 0.00s
sys     0m 0.00s

Another test by curl cmd:

time curl https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0     58      0 --:--:--  0:00:03 --:--:--    48
real    0m 3.61s
user    0m 0.01s
sys 0m 0.00s

Which is interesting if we put -4 option for curl which for resolving the address to IPV4, the result is much faster as it should be:

time curl -4 https://packagist.org/packages/list.json?vendor=composer  --output list.json
% Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                             Dload  Upload   Total   Spent    Left  
Speed
100   174    0   174    0     0    174      0 --:--:-- --:--:-- --:--:--  1359
real    0m 0.13s
user    0m 0.01s
sys 0m 0.00s

There's a workaround proposed here: #313 (comment)

Is there any soonish release to fix that?
Thx

@bboreham
Copy link

FYI @brb has found some kernel race conditions which relate to this symptom. See https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts for technical details

@zhouqiang-cl
Copy link

I found if i install bind-tools it will all be ok
RUN apk add bind-tools

@sebastianfuss
Copy link

sebastianfuss commented Aug 31, 2018

@zhouqiang-cl
Unfortunately RUN apk add bind-tools does not solve my name resolution problems. I am running a container with Alpine 3.8 in AWS Fargate and i am getting errors during resolving hostnames.

EDIT:
I moved as well to debian stretch slim and my dns problems seems to be solved.

@jurgenweber
Copy link

I have converted a few images to Debian Jessie/Stretch slim and my DNS issues went away. Kubernetes 1.9.7 using kops in AWS. This has been bothering us for a long while.

@based64god
Copy link

I too am seeing issues with MUSL DNS failure on a bare-metal Kubernetes cluster. The hosts in the cluster are all Ubuntu 18.04 machines using systemd-resolved for local DNS. I can reproduce the issue @sadok-f is having. This is on a Kubernetes 1.11.3 cluster (set up using Kubeadm 1.11.3, with Weave CNI), CoreDNS 1.1.3, systemd 237 on the host. Swapping images out for Debian stretch slim fixes the issues.

@jstoja
Copy link

jstoja commented Sep 19, 2018

@zhouqiang-cl @sebastianfuss installing bind-tools just seem to use a statically built binary seem to only solve the nslookup command but not the underlying issue.

captn3m0 added a commit to captn3m0/rss-bridge that referenced this issue May 26, 2019
Switches away from the unofficial Alpine+php image
to the official php-apache image. This has 2 advantages:

1. Official image is guaranteed to have regular updates etc
2. The persistent Docker Alpine DNS Issue goes away; gliderlabs/docker-alpine#255
logmanoriginal pushed a commit to RSS-Bridge/rss-bridge that referenced this issue Jun 1, 2019
* Switch Docker Image to official php base image

Switch from the unofficial Alpine+php image to the official php-apache image.
This has 2 advantages:

1. Official image is guaranteed to have regular updates, etc
2. The persistent Docker Alpine DNS Issue goes away;
gliderlabs/docker-alpine#255

* [Docker] Ignore more files from Docker Image
@XVilka
Copy link

XVilka commented Oct 31, 2019

Are there any updates on this issue?

infominer33 pushed a commit to web-work-tools/rss-bridge that referenced this issue Apr 17, 2020
)

* Switch Docker Image to official php base image

Switch from the unofficial Alpine+php image to the official php-apache image.
This has 2 advantages:

1. Official image is guaranteed to have regular updates, etc
2. The persistent Docker Alpine DNS Issue goes away;
gliderlabs/docker-alpine#255

* [Docker] Ignore more files from Docker Image
@codeb2cc
Copy link

@tomwidmer
Copy link

Is this resolved by the musl upgrade in https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests