-
Notifications
You must be signed in to change notification settings - Fork 137
Upstream i/o timeouts with mesos slave authentication (and docker registry) #297
Comments
Adding
|
And here's a log during a slave authentication failure.
Keep in mind that the following is the output when I do a
and when I hit the upstream DNS server directly
|
Hi @mcoffin, At first sight all I can see is that recursive queries are timing out. Have you tried raising the configured timeout? Also, to make it easier to debug, would you please adjust this issue according to our contribution guidelines? Thanks! |
Something else I noticed: It seems all your names have a search domain appended to them: i.e. What does your |
I'll reformat the issue in a second. Thanks in advance for taking a look at this. EDIT: It's also interesting to me that only some of the recursive calls fail, but it is very consistent about which ones. EDIT2: Added an overview section to the top level post to give a quick overview in the contrib guidelines format. |
@tsenart Any idea what's special about these specific queries so I can have at least an entrypoint in to making a fix for this? Sorry for the persistence but I'm blocked by an upgrade to mesos 0.24.1 which will require an upgrade in mesos-dns to a version that is having this issue with my config, so I'm kind of bottlenecked here. I really appreciate your help. |
Can you share your Mesos-DNS config? Perhaps tonight I won't be able to pay attention to this but it'll be my priority tomorrow. |
@tsenart Thanks! My config is below (in YAML, my chef cookbook converts the mesos_dns:
config:
zk: ${master_zk_string}/mesos
masters: []
refreshSeconds: 60
ttl: 60
domain: mesos
ns: ns1
port: 53
resolvers:
- 10.0.0.2
timeout: 2
listener: 0.0.0.0
SOAMname: root.ns1.mesos
SOARname: ns1.mesos
SOARefresh: 60
SOARetry: 600
SOAExpire: 86400
SOAMinttl: 60
dnson: true
httpon: true
httpport: 8123
externalon: true
recurseon: false |
New InfoThis problem also only occurs when using the upstream DNS server that comes with an EC2 VPC. It goes away when using |
@mcoffin: I'm having trouble debugging this without more information. I'm confused as to why the logs have mismatching IPs. What's the IP of the machine you're running on?
|
@mcoffin: Wanna hop on #mesos-dns on Freenode? |
@tsenart Hopping in. For people following in the GH issue, mesos-dns is running on |
@mcoffin: I'm there. |
@tsenart It's not showing you in |
After a long IRC session, I tracked the bug to here: https://github.com/mesosphere/mesos-dns/blob/master/exchanger/exchanger.go#L91 This was introduced in the refactoring of the resolver. I'll fix this first thing tomorrow. |
@mcoffin: It turns out this issue isn't explained by the bug I discovered yesterday. Thorough analysis of the traffic dump you provided me reveals a problematic flow:
I opened #301 to find out why Mesos-DNS is doing recursion instead of just forwarding. However, this doesn't explain your perceived change of behaviour since Here are two questions that can help us make progress:
|
|
@tsenart Let me see if I'm understanding all this correctly right now. As of now these are the unexpected behaviors we're looking for explanations for. Tell me if I'm missing any, or if some of these are explained and I just don't understand them. Problems
Other notesI'm working on a patch to mesos' slave authentication to improve the timeout code to differentiate between timouts in hostname resolution, and timeouts of the actual slave authentication process. While this is only a workaround for this problem with mesos-dns, it is the "correct" way to handle the situation upstream from mesos, as currently slave authentication doesn't fall back on the slave IP if the hostname resolve timeout is longer than the slave authentication process timeout. |
That is correct.
The
Yes, but only for AAAA type queries, which it doesn't seem to support.
It doesn't redirect us, but due to the current recursing logic in Mesos-DNS, we retry the same query against that DNS server.
I don't know where they're coming from. You should track that down.
I'm assuming the problem already happened before, but there weren't nearly as many error logs. You see, A type queries are still being resolved correctly, as far as I understand it. With #307, I'm fairly confident this whole issue will go away. |
Can you try running https://github.com/mesosphere/mesos-dns/releases/tag/v0.4.0-pre and see how it behaves? |
I'm also getting this same behavior.
This is filling up gigabytes of logging. |
Anything to report? |
The 0.4.0-pre no longer logs these errors. Was logging suppress or has the issue been addressed? |
@lnxmad: Network errors will still be logged but since external queries are now transparently forwarded and not recursed, the above condition of trying to reach an unreachable server isn't happening anymore. Please note that we've also released v0.4.0 since, so you should use that instead of the pre release. |
Thanks, I'll make sure to upgrade. I may use mesos-dns as a forward lookup zone with my existing dns server. This approach seems to make more sense in my development configuration. |
Sorry for the absence we had an incredibly busy end of the week. 0.4.0 has fixed the issue on my end. No longer seeing the strange timeouts. Thanks a bunch for the hard work @tsenart. I did a tcpdump as well and took a look at it just to make sure. Everything seemingly looks good now. |
Great! Closing this (long) issue :-) |
Overview
v0.3.0
, but this issue appeared somewhere betweenv0.1.2
andv0.2.0
Original Post
I'm not sure what's unique about the resolving that mesos slave authentication and docker registry so, but somewhere between 0.1.2 and 0.2.0 (and still present in 0.3.0), something broke that makes creates upstream timeouts from these requests that produce logs like the following:
I no longer have the logs of the problem ocuring during slave authentication, but with mesos-dns disabled, the slave is able to join properly. The logs look similar to those above, except the hostname is
ip-10-0-x-xxx.us-west-2.compute.internal
instead of the docker related one. These i/o timeouts are blocking me from upgrading to mesos 0.24.1 because I would need to upgrade past 0.1.2 of mesos-dns to do so.I'd have a go at fixing it myself, but I can't find what's unique about the DNS queries that are made by mesos and docker here. If i try to resolve
registry-1.docker.io
over mesos-dns viadig
, everything works as expected. Same goes for theip-10-0-x-xxx.us-west-2.compute.internal
hostnames during slave authentication.If you're here via google and only care about the slave authentication, a good workaround is to turn down the
timeout
in mesos-dns to ~2 (something less than 5sec) because slave authentication times out after 5 seconds. If the mesos-dns resolve fails after 2 seconds, it still has 3 seconds to fall back on to a slave ip and connect that way, rather than just waiting on the hopeless DNS result until the authentication itself times out. This is honestly probably a "bug" in slave-authentication as well, as it should separate the timeouts for hostname resolution and the actual authentication process, but this is a minimal workaround for the time being.The text was updated successfully, but these errors were encountered: