Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Upstream i/o timeouts with mesos slave authentication (and docker registry) #297

Closed
mcoffin opened this issue Oct 2, 2015 · 28 comments
Closed

Comments

@mcoffin
Copy link

mcoffin commented Oct 2, 2015

Overview

  1. I'm currently using v0.3.0, but this issue appeared somewhere between v0.1.2 and v0.2.0
  2. I am running this on ubuntu 15.04 (amd64) on EC2. I have tried running mesos-dns both from the released binaries from the github releases as well as through a docker container. Docker version is 1.8.1.
  3. I'm running mesos-dns on 3 mesos-masters. When a new slave is spun up, it's resolv.conf nameservers are replaced with the ips of the 3 mesos masters. When a slave tries to authenticate with these masters, timeouts will be logged by mesos-dns and the slave will fail to authenticate (as the timeouts in mesos-dns also cause the authentication to time out.). As soon as I kill mesos-dns, the slave & master will fall back to using the each other's ip rather than hostname resolution and connect successfully. This issue was not observed with mesos 0.23.0 and mesos-dns 0.1.2 under an identical configuration.
  4. I expected to see the hostnames and other DNS queries successfully execute without timeouts.
  5. Instead, we see that some DNS queries, but not others, cause upstream i/o timeouts. This appears to only happen with recursive queries, and only a consistent subset of them.

Original Post

I'm not sure what's unique about the resolving that mesos slave authentication and docker registry so, but somewhere between 0.1.2 and 0.2.0 (and still present in 0.3.0), something broke that makes creates upstream timeouts from these requests that produce logs like the following:

Oct 02 20:43:58 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:58 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:34856->172.16.0.23:5
Oct 02 20:43:58 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:58 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
Oct 02 20:43:58 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:58 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:34856->172.16.0.23:53: i/o timeout
Oct 02 20:43:51 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:51 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:44142->172.16.0.23:5
Oct 02 20:43:51 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:51 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
Oct 02 20:43:51 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:51 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:44142->172.16.0.23:53: i/o timeout
Oct 02 20:43:45 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:45 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:52317->172.16.0.23:5
Oct 02 20:43:45 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:45 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
Oct 02 20:43:45 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:45 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:52317->172.16.0.23:53: i/o timeout
Oct 02 20:43:38 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:38 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:57088->172.1
Oct 02 20:43:38 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:38 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
Oct 02 20:43:38 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:38 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:57088->172.16.0.23:53: i/o timeout
Oct 02 20:43:32 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:32 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:55689->172.1
Oct 02 20:43:32 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:32 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
Oct 02 20:43:32 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:32 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:55689->172.16.0.23:53: i/o timeout
Oct 02 20:43:25 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:25 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:38077->172.16.0.23:5
Oct 02 20:43:25 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:25 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.

I no longer have the logs of the problem ocuring during slave authentication, but with mesos-dns disabled, the slave is able to join properly. The logs look similar to those above, except the hostname is ip-10-0-x-xxx.us-west-2.compute.internal instead of the docker related one. These i/o timeouts are blocking me from upgrading to mesos 0.24.1 because I would need to upgrade past 0.1.2 of mesos-dns to do so.

I'd have a go at fixing it myself, but I can't find what's unique about the DNS queries that are made by mesos and docker here. If i try to resolve registry-1.docker.io over mesos-dns via dig, everything works as expected. Same goes for the ip-10-0-x-xxx.us-west-2.compute.internal hostnames during slave authentication.

If you're here via google and only care about the slave authentication, a good workaround is to turn down the timeout in mesos-dns to ~2 (something less than 5sec) because slave authentication times out after 5 seconds. If the mesos-dns resolve fails after 2 seconds, it still has 3 seconds to fall back on to a slave ip and connect that way, rather than just waiting on the hopeless DNS result until the authentication itself times out. This is honestly probably a "bug" in slave-authentication as well, as it should separate the timeouts for hostname resolution and the actual authentication process, but this is a minimal workaround for the time being.

@mcoffin
Copy link
Author

mcoffin commented Oct 2, 2015

Adding -v=10 didn't provide much insight, but here is a complete capture of the logs when one tries to pull a docker image from any machine that is using mesos-dns as it's primary dns server.

VERY VERBOSE: 2015/10/02 21:41:50 logging.go:75: {MesosRequests:66 MesosSuccess:66 MesosNXDomain:0 MesosFailed:0 NonMesosRequests:87 NonMesosSuccess:21 NonMesosNXDomain:30 NonMesosFailed:36 NonMesosRecursed:87}
< ... > truncated VERY_VERBOSE listing of .mesos records for privacy < ... >
ERROR: 2015/10/02 21:41:49 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:56027->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:49 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:49 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:56027->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:43 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:33661->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:43 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:43 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:33661->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:36 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:60410->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:36 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:36 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:60410->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:30 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:46773->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:30 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:30 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:46773->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:18 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:48839->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:18 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:18 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:48839->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:12 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:37968->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:12 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:12 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:37968->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:05 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:53052->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:05 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:05 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:53052->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:59 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:51908->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:59 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:40:59 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:51908->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:53 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:39560->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:53 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:40:53 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:39560->172.16.0.23:53: i/o timeout

@mcoffin
Copy link
Author

mcoffin commented Oct 2, 2015

And here's a log during a slave authentication failure.

ERROR: 2015/10/02 21:38:37 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:39633->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:35 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:41226->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:35 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:35 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:41226->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:20 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:43413->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:20 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:20 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:43413->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:14 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:46040->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:14 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:14 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:46040->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:02 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:48911->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:02 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:02 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:48911->172.16.0.23:53: i/o timeout

Keep in mind that the following is the output when I do a dig for the same hostname hitting mesos-dns.

$ dig @localhost ip-10-0-1-110.us-west-2.compute.internal

; <<>> DiG 9.9.5-9ubuntu0.3-Ubuntu <<>> @localhost ip-10-0-1-110.us-west-2.compute.internal
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26275
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ip-10-0-1-110.us-west-2.compute.internal. IN A

;; ANSWER SECTION:
ip-10-0-1-110.us-west-2.compute.internal. 70 IN A 10.0.1.110

;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Oct 02 21:47:30 UTC 2015
;; MSG SIZE  rcvd: 85

and when I hit the upstream DNS server directly

$ dig @10.0.0.2 ip-10-0-1-110.us-west-2.compute.internal

; <<>> DiG 9.9.5-9ubuntu0.3-Ubuntu <<>> @10.0.0.2 ip-10-0-1-110.us-west-2.compute.internal
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53915
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ip-10-0-1-110.us-west-2.compute.internal. IN A

;; ANSWER SECTION:
ip-10-0-1-110.us-west-2.compute.internal. 573 IN A 10.0.1.110

;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Fri Oct 02 21:54:09 UTC 2015
;; MSG SIZE  rcvd: 85

@tsenart
Copy link
Contributor

tsenart commented Oct 2, 2015

Hi @mcoffin,

At first sight all I can see is that recursive queries are timing out. Have you tried raising the configured timeout?

Also, to make it easier to debug, would you please adjust this issue according to our contribution guidelines?

Thanks!

@tsenart
Copy link
Contributor

tsenart commented Oct 5, 2015

Something else I noticed: It seems all your names have a search domain appended to them: i.e. registry-1.docker.io.us-west-2.compute.internal.

What does your /etc/resolv.conf look like?

@mcoffin
Copy link
Author

mcoffin commented Oct 5, 2015

I'll reformat the issue in a second. /etc/resolv.conf has the line search us-west-2.compute.internal, which is necessary to have ip-10-0-x-xxx attempt to be resolved as ip-10-0-x-xxx.us-west-2.compute.internal. I assume that isn't a problem since it's pretty much straight out of the /etc/resolv.conf that comes on a newly created EC2 instance in that region.

Thanks in advance for taking a look at this.

EDIT: It's also interesting to me that only some of the recursive calls fail, but it is very consistent about which ones.

EDIT2: Added an overview section to the top level post to give a quick overview in the contrib guidelines format.

@mcoffin
Copy link
Author

mcoffin commented Oct 5, 2015

@tsenart Any idea what's special about these specific queries so I can have at least an entrypoint in to making a fix for this? Sorry for the persistence but I'm blocked by an upgrade to mesos 0.24.1 which will require an upgrade in mesos-dns to a version that is having this issue with my config, so I'm kind of bottlenecked here. I really appreciate your help.

@tsenart
Copy link
Contributor

tsenart commented Oct 5, 2015

Can you share your Mesos-DNS config? Perhaps tonight I won't be able to pay attention to this but it'll be my priority tomorrow.

@mcoffin
Copy link
Author

mcoffin commented Oct 5, 2015

@tsenart Thanks!

My config is below (in YAML, my chef cookbook converts the mesos-dns/config key to a .json config file when it is run).

mesos_dns:
  config:
    zk: ${master_zk_string}/mesos
    masters: []
    refreshSeconds: 60
    ttl: 60
    domain: mesos
    ns: ns1
    port: 53
    resolvers:
      - 10.0.0.2
    timeout: 2
    listener: 0.0.0.0
    SOAMname: root.ns1.mesos
    SOARname: ns1.mesos
    SOARefresh: 60
    SOARetry: 600
    SOAExpire: 86400
    SOAMinttl: 60
    dnson: true
    httpon: true
    httpport: 8123
    externalon: true
    recurseon: false

@mcoffin
Copy link
Author

mcoffin commented Oct 5, 2015

New Info

This problem also only occurs when using the upstream DNS server that comes with an EC2 VPC. It goes away when using 8.8.8.8 or 8.8.4.4 as resolvers. unfortunately, the VPC dns server is also doing the resolution of the internal ec2 hostnames, so I can't switch over. Just more information.

@tsenart
Copy link
Contributor

tsenart commented Oct 5, 2015

@mcoffin: I'm having trouble debugging this without more information. I'm confused as to why the logs have mismatching IPs. What's the IP of the machine you're running on? 10.0.0.254? If so, I don't understand why the packet is going to 172.16.0.23.

ERROR: 2015/10/02 21:38:37 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:39633->172.16.0.23:53: i/o timeout

@tsenart
Copy link
Contributor

tsenart commented Oct 5, 2015

@mcoffin: Wanna hop on #mesos-dns on Freenode?

@mcoffin
Copy link
Author

mcoffin commented Oct 5, 2015

@tsenart Hopping in.

For people following in the GH issue, mesos-dns is running on 10.0.0.254. I have no clue where to 172.16.xxx.xxx ip is coming from.

@tsenart
Copy link
Contributor

tsenart commented Oct 5, 2015

@mcoffin: I'm there.

@mcoffin
Copy link
Author

mcoffin commented Oct 5, 2015

@tsenart It's not showing you in #mesos-dns, and I can't query your nick either.

@tsenart
Copy link
Contributor

tsenart commented Oct 5, 2015

After a long IRC session, I tracked the bug to here: https://github.com/mesosphere/mesos-dns/blob/master/exchanger/exchanger.go#L91

This was introduced in the refactoring of the resolver. I'll fix this first thing tomorrow.

@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

@mcoffin: It turns out this issue isn't explained by the bug I discovered yesterday. Thorough analysis of the traffic dump you provided me reveals a problematic flow:

  1. 10.0.2.253 asks 10.0.0.254 (Mesos-DNS) for ip-10-0-2-253.us-west-2.compute.internal IN AAAA
  2. 10.0.0.254 forwards the query to its configured resolver 10.0.0.2
  3. 10.0.0.2 replies with no answer, no error and an SOA record in the authoritative zone.
  4. 10.0.0.254 recurses the original query to the nameserver found in the SOA record from the previous reply
  5. In order to do so, 10.0.0.254 asks 10.0.0.2 for ns0.us-west-2.compute.internal IN A
  6. 10.0.0.2 replies with 172.16.0.23
  7. 10.0.0.254 asks 172.16.0.23 for ip-10-0-2-253.us-west-2.compute.internal IN AAAA
  8. The previous query times out because 172.16.0.23 isn't reachable.

I opened #301 to find out why Mesos-DNS is doing recursion instead of just forwarding. However, this doesn't explain your perceived change of behaviour since v0.1.1.

Here are two questions that can help us make progress:

  1. Why are AAAA queries being issued and from where?
  2. Why do you need to configure the Mesos-DNS external resolver with the Amazon DNS server IP when it is already in /etc/resolv.conf for your processes to use?

tsenart pushed a commit that referenced this issue Oct 6, 2015
While debugging #297 and #298, it became apparent that this extra
information would be very useful in aiding understanding of the issue.
tsenart pushed a commit that referenced this issue Oct 6, 2015
While debugging #297 and #298, it became apparent that this extra
information would be very useful in aiding understanding of the issue.
@mcoffin
Copy link
Author

mcoffin commented Oct 6, 2015

@tsenart

Here are two questions that can help us make progress here

  1. I'm really not sure why these are being sent. I'll look more in to it, but I'm not familiar with the docker code base at all, and my familiarity with the mesos codebase is minimal, and knowlege of the slave authentication system is limited to what I've learned while debugging this issue.
  2. This was being included from a legacy configuration. When I removed the "resolvers": ["10.0.0.2"] key from the JSON configuration, I can't resolve internal hostnames over dig anymore, despite my /etc/resolv.conf being
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.0.2
search us-west-2.compute.internal

@mcoffin
Copy link
Author

mcoffin commented Oct 6, 2015

@tsenart Let me see if I'm understanding all this correctly right now. As of now these are the unexpected behaviors we're looking for explanations for. Tell me if I'm missing any, or if some of these are explained and I just don't understand them.

Problems

  1. resolvers has to be specified manually in the mesos-dns JSON configuration despite being in /etc/resolv.conf.
  2. Mesos-DNS is doing recursion instead of forwarding despite the recurseon flag in the config being false.
  3. Amazon's VPC DNS server is giving us an SOA record for ip-10-0-0-250.us-west-2.compute.internal instead of just resolving it.
  4. Amazon's VPC DNS server redirects us to 172.16.0.23, which is unreachable from the VPC whereas we'd expect to only be referred to accessible servers. (This might be fixable by enabling EC2-ClassicLink for the VPC since 172.16.0.23 is the default dns server for the EC2-Classic region before the days of VPCs, but there is no reason we should have to do that).
  5. AAAA records are being requested via queries from docker / mesos slave authentication whereas we don't expect that to happen.
  6. This problem began ocurring between v0.1.2 and v0.2.0 whereas we would expect this to be a problem for all versions (due to the unreachability of 172.16.0.23).

Other notes

I'm working on a patch to mesos' slave authentication to improve the timeout code to differentiate between timouts in hostname resolution, and timeouts of the actual slave authentication process. While this is only a workaround for this problem with mesos-dns, it is the "correct" way to handle the situation upstream from mesos, as currently slave authentication doesn't fall back on the slave IP if the hostname resolve timeout is longer than the slave authentication process timeout.

tsenart pushed a commit that referenced this issue Oct 6, 2015
While debugging #297 and #298, it became apparent that this extra
information would be very useful in aiding understanding of the issue.
@tsenart
Copy link
Contributor

tsenart commented Oct 6, 2015

  1. resolvers has to be specified manually in the mesos-dns JSON configuration despite being in /etc/resolv.conf.

That is correct.

  1. Mesos-DNS is doing recursion instead of forwarding despite the recurseon flag in the config being false.

The recurseon flag is poorly named. It actually has nothing to do with enabling recursion for external queries. https://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html

  1. Amazon's VPC DNS server is giving us an SOA record for ip-10-0-0-250.us-west-2.compute.internal instead of just resolving it.

Yes, but only for AAAA type queries, which it doesn't seem to support.

  1. Amazon's VPC DNS server redirects us to 172.16.0.23, which is unreachable from the VPC whereas we'd expect to only be referred to accessible servers. (This might be fixable by enabling EC2-ClassicLink for the VPC since 172.16.0.23 is the default dns server for the EC2-Classic region before the days of VPCs, but there is no reason we should have to do that).

It doesn't redirect us, but due to the current recursing logic in Mesos-DNS, we retry the same query against that DNS server.

  1. AAAA records are being requested via queries from docker / mesos slave authentication whereas we don't expect that to happen.

I don't know where they're coming from. You should track that down.

  1. This problem began ocurring between v0.1.2 and v0.2.0 whereas we would expect this to be a problem for all versions (due to the unreachability of 172.16.0.23).

I'm assuming the problem already happened before, but there weren't nearly as many error logs. You see, A type queries are still being resolved correctly, as far as I understand it.

With #307, I'm fairly confident this whole issue will go away.

@tsenart
Copy link
Contributor

tsenart commented Oct 7, 2015

Can you try running https://github.com/mesosphere/mesos-dns/releases/tag/v0.4.0-pre and see how it behaves?

@lnxmad
Copy link

lnxmad commented Oct 8, 2015

I'm also getting this same behavior.

Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:281: failed external DNS lookup of "g-ns-717.awsdns-11.org.": dial udp: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 exchanger.go:64: error exchanging with "8.8.8.8:53": dial udp: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:280: g-ns-717.awsdns-11.org.
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:281: failed external DNS lookup of "g-ns-717.awsdns-11.org.": dial udp: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 exchanger.go:64: error exchanging with "8.8.8.8:53": read udp *internalip*:53253->8.8.8.8:53: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:280: g-ns-717.awsdns-11.org.

This is filling up gigabytes of logging.

@tsenart
Copy link
Contributor

tsenart commented Oct 8, 2015

@lnxmad, @mcoffin: Can't you set up log rotation?

@tsenart
Copy link
Contributor

tsenart commented Oct 9, 2015

Anything to report?

@lnxmad
Copy link

lnxmad commented Oct 10, 2015

The 0.4.0-pre no longer logs these errors. Was logging suppress or has the issue been addressed?

@tsenart
Copy link
Contributor

tsenart commented Oct 10, 2015

@lnxmad: Network errors will still be logged but since external queries are now transparently forwarded and not recursed, the above condition of trying to reach an unreachable server isn't happening anymore. Please note that we've also released v0.4.0 since, so you should use that instead of the pre release.

@lnxmad
Copy link

lnxmad commented Oct 10, 2015

Thanks, I'll make sure to upgrade. I may use mesos-dns as a forward lookup zone with my existing dns server. This approach seems to make more sense in my development configuration.

@mcoffin
Copy link
Author

mcoffin commented Oct 12, 2015

Sorry for the absence we had an incredibly busy end of the week. 0.4.0 has fixed the issue on my end. No longer seeing the strange timeouts. Thanks a bunch for the hard work @tsenart.

I did a tcpdump as well and took a look at it just to make sure. Everything seemingly looks good now.

@tsenart
Copy link
Contributor

tsenart commented Oct 12, 2015

Great! Closing this (long) issue :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants