Upstream i/o timeouts with mesos slave authentication (and docker registry) #297

mcoffin · 2015-10-02T21:23:12Z

Overview

I'm currently using v0.3.0, but this issue appeared somewhere between v0.1.2 and v0.2.0
I am running this on ubuntu 15.04 (amd64) on EC2. I have tried running mesos-dns both from the released binaries from the github releases as well as through a docker container. Docker version is 1.8.1.
I'm running mesos-dns on 3 mesos-masters. When a new slave is spun up, it's resolv.conf nameservers are replaced with the ips of the 3 mesos masters. When a slave tries to authenticate with these masters, timeouts will be logged by mesos-dns and the slave will fail to authenticate (as the timeouts in mesos-dns also cause the authentication to time out.). As soon as I kill mesos-dns, the slave & master will fall back to using the each other's ip rather than hostname resolution and connect successfully. This issue was not observed with mesos 0.23.0 and mesos-dns 0.1.2 under an identical configuration.
I expected to see the hostnames and other DNS queries successfully execute without timeouts.
Instead, we see that some DNS queries, but not others, cause upstream i/o timeouts. This appears to only happen with recursive queries, and only a consistent subset of them.

Original Post

I'm not sure what's unique about the resolving that mesos slave authentication and docker registry so, but somewhere between 0.1.2 and 0.2.0 (and still present in 0.3.0), something broke that makes creates upstream timeouts from these requests that produce logs like the following:

Oct 02 20:43:58 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:58 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:34856->172.16.0.23:5
Oct 02 20:43:58 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:58 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
Oct 02 20:43:58 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:58 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:34856->172.16.0.23:53: i/o timeout
Oct 02 20:43:51 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:51 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:44142->172.16.0.23:5
Oct 02 20:43:51 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:51 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
Oct 02 20:43:51 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:51 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:44142->172.16.0.23:53: i/o timeout
Oct 02 20:43:45 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:45 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:52317->172.16.0.23:5
Oct 02 20:43:45 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:45 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
Oct 02 20:43:45 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:45 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:52317->172.16.0.23:53: i/o timeout
Oct 02 20:43:38 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:38 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:57088->172.1
Oct 02 20:43:38 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:38 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
Oct 02 20:43:38 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:38 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:57088->172.16.0.23:53: i/o timeout
Oct 02 20:43:32 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:32 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:55689->172.1
Oct 02 20:43:32 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:32 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
Oct 02 20:43:32 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:32 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:55689->172.16.0.23:53: i/o timeout
Oct 02 20:43:25 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:25 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:38077->172.16.0.23:5
Oct 02 20:43:25 ip-10-0-0-254 docker[5812]: ERROR: 2015/10/02 20:43:25 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.

I no longer have the logs of the problem ocuring during slave authentication, but with mesos-dns disabled, the slave is able to join properly. The logs look similar to those above, except the hostname is ip-10-0-x-xxx.us-west-2.compute.internal instead of the docker related one. These i/o timeouts are blocking me from upgrading to mesos 0.24.1 because I would need to upgrade past 0.1.2 of mesos-dns to do so.

I'd have a go at fixing it myself, but I can't find what's unique about the DNS queries that are made by mesos and docker here. If i try to resolve registry-1.docker.io over mesos-dns via dig, everything works as expected. Same goes for the ip-10-0-x-xxx.us-west-2.compute.internal hostnames during slave authentication.

If you're here via google and only care about the slave authentication, a good workaround is to turn down the timeout in mesos-dns to ~2 (something less than 5sec) because slave authentication times out after 5 seconds. If the mesos-dns resolve fails after 2 seconds, it still has 3 seconds to fall back on to a slave ip and connect that way, rather than just waiting on the hopeless DNS result until the authentication itself times out. This is honestly probably a "bug" in slave-authentication as well, as it should separate the timeouts for hostname resolution and the actual authentication process, but this is a minimal workaround for the time being.

The text was updated successfully, but these errors were encountered:

mcoffin · 2015-10-02T21:44:38Z

Adding -v=10 didn't provide much insight, but here is a complete capture of the logs when one tries to pull a docker image from any machine that is using mesos-dns as it's primary dns server.

VERY VERBOSE: 2015/10/02 21:41:50 logging.go:75: {MesosRequests:66 MesosSuccess:66 MesosNXDomain:0 MesosFailed:0 NonMesosRequests:87 NonMesosSuccess:21 NonMesosNXDomain:30 NonMesosFailed:36 NonMesosRecursed:87}
< ... > truncated VERY_VERBOSE listing of .mesos records for privacy < ... >
ERROR: 2015/10/02 21:41:49 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:56027->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:49 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:49 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:56027->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:43 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:33661->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:43 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:43 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:33661->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:36 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:60410->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:36 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:36 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:60410->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:30 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:46773->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:30 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:30 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:46773->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:18 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:48839->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:18 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:18 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:48839->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:12 resolver.go:281: failed external DNS lookup of "dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.": read udp 10.0.0.254:37968->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:12 resolver.go:280: dseasb33srnrn.cloudfront.net.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:12 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:37968->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:05 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:53052->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:41:05 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:41:05 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:53052->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:59 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:51908->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:59 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:40:59 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:51908->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:53 resolver.go:281: failed external DNS lookup of "registry-1.docker.io.us-west-2.compute.internal.": read udp 10.0.0.254:39560->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:40:53 resolver.go:280: registry-1.docker.io.us-west-2.compute.internal.
ERROR: 2015/10/02 21:40:53 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:39560->172.16.0.23:53: i/o timeout

mcoffin · 2015-10-02T21:46:56Z

And here's a log during a slave authentication failure.

ERROR: 2015/10/02 21:38:37 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:39633->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:35 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:41226->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:35 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:35 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:41226->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:20 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:43413->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:20 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:20 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:43413->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:14 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:46040->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:14 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:14 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:46040->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:02 resolver.go:281: failed external DNS lookup of "ip-10-0-1-110.us-west-2.compute.internal.": read udp 10.0.0.254:48911->172.16.0.23:53: i/o timeout
ERROR: 2015/10/02 21:38:02 resolver.go:280: ip-10-0-1-110.us-west-2.compute.internal.
ERROR: 2015/10/02 21:38:02 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:48911->172.16.0.23:53: i/o timeout

Keep in mind that the following is the output when I do a dig for the same hostname hitting mesos-dns.

$ dig @localhost ip-10-0-1-110.us-west-2.compute.internal

; <<>> DiG 9.9.5-9ubuntu0.3-Ubuntu <<>> @localhost ip-10-0-1-110.us-west-2.compute.internal
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26275
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ip-10-0-1-110.us-west-2.compute.internal. IN A

;; ANSWER SECTION:
ip-10-0-1-110.us-west-2.compute.internal. 70 IN A 10.0.1.110

;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Oct 02 21:47:30 UTC 2015
;; MSG SIZE  rcvd: 85

and when I hit the upstream DNS server directly

$ dig @10.0.0.2 ip-10-0-1-110.us-west-2.compute.internal

; <<>> DiG 9.9.5-9ubuntu0.3-Ubuntu <<>> @10.0.0.2 ip-10-0-1-110.us-west-2.compute.internal
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53915
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ip-10-0-1-110.us-west-2.compute.internal. IN A

;; ANSWER SECTION:
ip-10-0-1-110.us-west-2.compute.internal. 573 IN A 10.0.1.110

;; Query time: 0 msec
;; SERVER: 10.0.0.2#53(10.0.0.2)
;; WHEN: Fri Oct 02 21:54:09 UTC 2015
;; MSG SIZE  rcvd: 85

tsenart · 2015-10-02T22:20:47Z

Hi @mcoffin,

At first sight all I can see is that recursive queries are timing out. Have you tried raising the configured timeout?

Also, to make it easier to debug, would you please adjust this issue according to our contribution guidelines?

Thanks!

tsenart · 2015-10-05T10:20:39Z

Something else I noticed: It seems all your names have a search domain appended to them: i.e. registry-1.docker.io.us-west-2.compute.internal.

What does your /etc/resolv.conf look like?

mcoffin · 2015-10-05T15:58:22Z

I'll reformat the issue in a second. /etc/resolv.conf has the line search us-west-2.compute.internal, which is necessary to have ip-10-0-x-xxx attempt to be resolved as ip-10-0-x-xxx.us-west-2.compute.internal. I assume that isn't a problem since it's pretty much straight out of the /etc/resolv.conf that comes on a newly created EC2 instance in that region.

Thanks in advance for taking a look at this.

EDIT: It's also interesting to me that only some of the recursive calls fail, but it is very consistent about which ones.

EDIT2: Added an overview section to the top level post to give a quick overview in the contrib guidelines format.

mcoffin · 2015-10-05T17:55:26Z

@tsenart Any idea what's special about these specific queries so I can have at least an entrypoint in to making a fix for this? Sorry for the persistence but I'm blocked by an upgrade to mesos 0.24.1 which will require an upgrade in mesos-dns to a version that is having this issue with my config, so I'm kind of bottlenecked here. I really appreciate your help.

tsenart · 2015-10-05T18:06:14Z

Can you share your Mesos-DNS config? Perhaps tonight I won't be able to pay attention to this but it'll be my priority tomorrow.

mcoffin · 2015-10-05T18:12:58Z

@tsenart Thanks!

My config is below (in YAML, my chef cookbook converts the mesos-dns/config key to a .json config file when it is run).

mesos_dns:
  config:
    zk: ${master_zk_string}/mesos
    masters: []
    refreshSeconds: 60
    ttl: 60
    domain: mesos
    ns: ns1
    port: 53
    resolvers:
      - 10.0.0.2
    timeout: 2
    listener: 0.0.0.0
    SOAMname: root.ns1.mesos
    SOARname: ns1.mesos
    SOARefresh: 60
    SOARetry: 600
    SOAExpire: 86400
    SOAMinttl: 60
    dnson: true
    httpon: true
    httpport: 8123
    externalon: true
    recurseon: false

mcoffin · 2015-10-05T20:44:51Z

New Info

This problem also only occurs when using the upstream DNS server that comes with an EC2 VPC. It goes away when using 8.8.8.8 or 8.8.4.4 as resolvers. unfortunately, the VPC dns server is also doing the resolution of the internal ec2 hostnames, so I can't switch over. Just more information.

tsenart · 2015-10-05T20:45:11Z

@mcoffin: I'm having trouble debugging this without more information. I'm confused as to why the logs have mismatching IPs. What's the IP of the machine you're running on? 10.0.0.254? If so, I don't understand why the packet is going to 172.16.0.23.

ERROR: 2015/10/02 21:38:37 exchanger.go:64: error exchanging with "10.0.0.2:53": read udp 10.0.0.254:39633->172.16.0.23:53: i/o timeout

tsenart · 2015-10-05T20:46:24Z

@mcoffin: Wanna hop on #mesos-dns on Freenode?

mcoffin · 2015-10-05T20:49:24Z

@tsenart Hopping in.

For people following in the GH issue, mesos-dns is running on 10.0.0.254. I have no clue where to 172.16.xxx.xxx ip is coming from.

tsenart · 2015-10-05T20:56:47Z

@mcoffin: I'm there.

mcoffin · 2015-10-05T20:59:11Z

@tsenart It's not showing you in #mesos-dns, and I can't query your nick either.

tsenart · 2015-10-05T22:49:48Z

After a long IRC session, I tracked the bug to here: https://github.com/mesosphere/mesos-dns/blob/master/exchanger/exchanger.go#L91

This was introduced in the refactoring of the resolver. I'll fix this first thing tomorrow.

tsenart · 2015-10-06T13:14:48Z

@mcoffin: It turns out this issue isn't explained by the bug I discovered yesterday. Thorough analysis of the traffic dump you provided me reveals a problematic flow:

10.0.2.253 asks 10.0.0.254 (Mesos-DNS) for ip-10-0-2-253.us-west-2.compute.internal IN AAAA
10.0.0.254 forwards the query to its configured resolver 10.0.0.2
10.0.0.2 replies with no answer, no error and an SOA record in the authoritative zone.
10.0.0.254 recurses the original query to the nameserver found in the SOA record from the previous reply
In order to do so, 10.0.0.254 asks 10.0.0.2 for ns0.us-west-2.compute.internal IN A
10.0.0.2 replies with 172.16.0.23
10.0.0.254 asks 172.16.0.23 for ip-10-0-2-253.us-west-2.compute.internal IN AAAA
The previous query times out because 172.16.0.23 isn't reachable.

I opened #301 to find out why Mesos-DNS is doing recursion instead of just forwarding. However, this doesn't explain your perceived change of behaviour since v0.1.1.

Here are two questions that can help us make progress:

Why are AAAA queries being issued and from where?
Why do you need to configure the Mesos-DNS external resolver with the Amazon DNS server IP when it is already in /etc/resolv.conf for your processes to use?

While debugging #297 and #298, it became apparent that this extra information would be very useful in aiding understanding of the issue.

mcoffin · 2015-10-06T15:35:20Z

@tsenart

Here are two questions that can help us make progress here

I'm really not sure why these are being sent. I'll look more in to it, but I'm not familiar with the docker code base at all, and my familiarity with the mesos codebase is minimal, and knowlege of the slave authentication system is limited to what I've learned while debugging this issue.
This was being included from a legacy configuration. When I removed the "resolvers": ["10.0.0.2"] key from the JSON configuration, I can't resolve internal hostnames over dig anymore, despite my /etc/resolv.conf being

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 10.0.0.2
search us-west-2.compute.internal

mcoffin · 2015-10-06T16:48:05Z

@tsenart Let me see if I'm understanding all this correctly right now. As of now these are the unexpected behaviors we're looking for explanations for. Tell me if I'm missing any, or if some of these are explained and I just don't understand them.

Problems

resolvers has to be specified manually in the mesos-dns JSON configuration despite being in /etc/resolv.conf.
Mesos-DNS is doing recursion instead of forwarding despite the recurseon flag in the config being false.
Amazon's VPC DNS server is giving us an SOA record for ip-10-0-0-250.us-west-2.compute.internal instead of just resolving it.
Amazon's VPC DNS server redirects us to 172.16.0.23, which is unreachable from the VPC whereas we'd expect to only be referred to accessible servers. (This might be fixable by enabling EC2-ClassicLink for the VPC since 172.16.0.23 is the default dns server for the EC2-Classic region before the days of VPCs, but there is no reason we should have to do that).
AAAA records are being requested via queries from docker / mesos slave authentication whereas we don't expect that to happen.
This problem began ocurring between v0.1.2 and v0.2.0 whereas we would expect this to be a problem for all versions (due to the unreachability of 172.16.0.23).

Other notes

I'm working on a patch to mesos' slave authentication to improve the timeout code to differentiate between timouts in hostname resolution, and timeouts of the actual slave authentication process. While this is only a workaround for this problem with mesos-dns, it is the "correct" way to handle the situation upstream from mesos, as currently slave authentication doesn't fall back on the slave IP if the hostname resolve timeout is longer than the slave authentication process timeout.

While debugging #297 and #298, it became apparent that this extra information would be very useful in aiding understanding of the issue.

tsenart · 2015-10-06T17:52:10Z

resolvers has to be specified manually in the mesos-dns JSON configuration despite being in /etc/resolv.conf.

That is correct.

Mesos-DNS is doing recursion instead of forwarding despite the recurseon flag in the config being false.

The recurseon flag is poorly named. It actually has nothing to do with enabling recursion for external queries. https://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html

Amazon's VPC DNS server is giving us an SOA record for ip-10-0-0-250.us-west-2.compute.internal instead of just resolving it.

Yes, but only for AAAA type queries, which it doesn't seem to support.

Amazon's VPC DNS server redirects us to 172.16.0.23, which is unreachable from the VPC whereas we'd expect to only be referred to accessible servers. (This might be fixable by enabling EC2-ClassicLink for the VPC since 172.16.0.23 is the default dns server for the EC2-Classic region before the days of VPCs, but there is no reason we should have to do that).

It doesn't redirect us, but due to the current recursing logic in Mesos-DNS, we retry the same query against that DNS server.

AAAA records are being requested via queries from docker / mesos slave authentication whereas we don't expect that to happen.

I don't know where they're coming from. You should track that down.

This problem began ocurring between v0.1.2 and v0.2.0 whereas we would expect this to be a problem for all versions (due to the unreachability of 172.16.0.23).

I'm assuming the problem already happened before, but there weren't nearly as many error logs. You see, A type queries are still being resolved correctly, as far as I understand it.

With #307, I'm fairly confident this whole issue will go away.

tsenart · 2015-10-07T22:29:28Z

Can you try running https://github.com/mesosphere/mesos-dns/releases/tag/v0.4.0-pre and see how it behaves?

lnxmad · 2015-10-08T00:01:56Z

I'm also getting this same behavior.

Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:281: failed external DNS lookup of "g-ns-717.awsdns-11.org.": dial udp: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 exchanger.go:64: error exchanging with "8.8.8.8:53": dial udp: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:280: g-ns-717.awsdns-11.org.
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:281: failed external DNS lookup of "g-ns-717.awsdns-11.org.": dial udp: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 exchanger.go:64: error exchanging with "8.8.8.8:53": read udp *internalip*:53253->8.8.8.8:53: i/o timeout
Oct  7 19:55:50 mesos-slave01 mesos-dns[16618]: ERROR: 2015/10/07 19:55:50 resolver.go:280: g-ns-717.awsdns-11.org.

This is filling up gigabytes of logging.

tsenart · 2015-10-08T13:17:05Z

@lnxmad, @mcoffin: Can't you set up log rotation?

tsenart · 2015-10-09T07:53:45Z

Anything to report?

lnxmad · 2015-10-10T15:17:41Z

The 0.4.0-pre no longer logs these errors. Was logging suppress or has the issue been addressed?

tsenart · 2015-10-10T16:43:10Z

@lnxmad: Network errors will still be logged but since external queries are now transparently forwarded and not recursed, the above condition of trying to reach an unreachable server isn't happening anymore. Please note that we've also released v0.4.0 since, so you should use that instead of the pre release.

lnxmad · 2015-10-10T16:45:42Z

Thanks, I'll make sure to upgrade. I may use mesos-dns as a forward lookup zone with my existing dns server. This approach seems to make more sense in my development configuration.

mcoffin · 2015-10-12T15:19:50Z

Sorry for the absence we had an incredibly busy end of the week. 0.4.0 has fixed the issue on my end. No longer seeing the strange timeouts. Thanks a bunch for the hard work @tsenart.

I did a tcpdump as well and took a look at it just to make sure. Everything seemingly looks good now.

tsenart · 2015-10-12T16:00:47Z

Great! Closing this (long) issue :-)

tsenart mentioned this issue Oct 5, 2015

Server misbehaving #298

Closed

tsenart pushed a commit that referenced this issue Oct 6, 2015

exchanger: Improve ErrorLogging

5b7c99e

While debugging #297 and #298, it became apparent that this extra information would be very useful in aiding understanding of the issue.

tsenart mentioned this issue Oct 6, 2015

exchanger: Improve ErrorLogging #303

Merged

tsenart pushed a commit that referenced this issue Oct 6, 2015

exchanger: Improve ErrorLogging

442000b

While debugging #297 and #298, it became apparent that this extra information would be very useful in aiding understanding of the issue.

tsenart pushed a commit that referenced this issue Oct 6, 2015

exchanger: Improve ErrorLogging

cdccd03

While debugging #297 and #298, it became apparent that this extra information would be very useful in aiding understanding of the issue.

tsenart mentioned this issue Oct 6, 2015

Transparent proxying of external queries #307

Merged

tsenart closed this as completed Oct 12, 2015

tsenart mentioned this issue Oct 16, 2015

resolution of AWS internal names hangs on version 0.3.0 #328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream i/o timeouts with mesos slave authentication (and docker registry) #297

Upstream i/o timeouts with mesos slave authentication (and docker registry) #297

mcoffin commented Oct 2, 2015

mcoffin commented Oct 2, 2015

mcoffin commented Oct 2, 2015

tsenart commented Oct 2, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

tsenart commented Oct 6, 2015

mcoffin commented Oct 6, 2015

mcoffin commented Oct 6, 2015

tsenart commented Oct 6, 2015

tsenart commented Oct 7, 2015

lnxmad commented Oct 8, 2015

tsenart commented Oct 8, 2015

tsenart commented Oct 9, 2015

lnxmad commented Oct 10, 2015

tsenart commented Oct 10, 2015

lnxmad commented Oct 10, 2015

mcoffin commented Oct 12, 2015

tsenart commented Oct 12, 2015

Upstream i/o timeouts with mesos slave authentication (and docker registry) #297

Upstream i/o timeouts with mesos slave authentication (and docker registry) #297

Comments

mcoffin commented Oct 2, 2015

Overview

Original Post

mcoffin commented Oct 2, 2015

mcoffin commented Oct 2, 2015

tsenart commented Oct 2, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

mcoffin commented Oct 5, 2015

New Info

tsenart commented Oct 5, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

mcoffin commented Oct 5, 2015

tsenart commented Oct 5, 2015

tsenart commented Oct 6, 2015

mcoffin commented Oct 6, 2015

mcoffin commented Oct 6, 2015

Problems

Other notes

tsenart commented Oct 6, 2015

tsenart commented Oct 7, 2015

lnxmad commented Oct 8, 2015

tsenart commented Oct 8, 2015

tsenart commented Oct 9, 2015

lnxmad commented Oct 10, 2015

tsenart commented Oct 10, 2015

lnxmad commented Oct 10, 2015

mcoffin commented Oct 12, 2015

tsenart commented Oct 12, 2015