Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async DNS resolution #13619

Open
z64 opened this issue Jul 2, 2023 · 17 comments
Open

Async DNS resolution #13619

z64 opened this issue Jul 2, 2023 · 17 comments

Comments

@z64
Copy link
Contributor

z64 commented Jul 2, 2023

Currently, the socket runtime will call LibC.getaddrinfo, which will block the event loop until it completes.

For latency-sensitive applications, this can cause slowdowns that stack, or other issues. It is easy to write a python script that makes a bunch of HTTP requests to some hosts, and it will be much faster than Crystal, because it doesn't get stuck on DNS.

Since all other discussions I could find are quite old & closed, opening this new one for visibility & broader view of the problem's history.

Related material

  • Fix/Implement own DNS resolver #2660
    • In a past era, Crystal used libevents getaddrinfo, but it was discarded for plain LibC getaddrinfo due to other issues. I don't know if anything has changed on libevent's end to consider trying it again.
  • dns.cr shard
    • Currently we use this for non-blocking DNS, and it works "ok". I do not recommend it though - unless you really need it - as it is far too complicated.
  • Theaded DNS resolver (from Configurable DNS resolvers #4236)
    • To me, it seems like this could be a next step: Keep the robustness of libc getaddrinfo, but do it in another thread for the event loop is not blocked. This implementation could be brought up to date, or a new one made.
stakach added a commit to PlaceOS/drivers that referenced this issue Aug 18, 2023
@straight-shoota
Copy link
Member

A great benefit of getaddrinfo is that it's battle tested and covers a lot of niches. It's used almost ubiquitously and its behaviour is considered as the system default (which it actually is on many systems).

Its main issue is that its blocking the current thread. We can alleviate that a bit by running it in a dedicated thread. This could happen implicitly and would be a nice feature in the context of the ongoing multi-threading refactor (ref crystal-lang/rfcs#2).
This is a well-known technique and should work relatively fine. However, it's quite inefficient.

So I think we'll eventually need a native implementation for DNS resolution that uses Crystal's concurrency.
https://github.com/636f7374/dns.cr could be a good source for inspiration, but it's far too complex for this. We only need a relatively simple implementation, covering a fraction of its features.

We might take further inspiration from Go: https://pkg.go.dev/net#hdr-Name_Resolution
It actually still uses getaddrinfo because the native implementation is incomplete. There's a rather complex set of rules to decide when to use the native implementation and when getaddrinfo.
Considering that even Go hasn't managed to reach feature parity with getaddrinfo (I'm not even sure if that's a goal for them), maybe we should really keep our focus on making getaddrinfo usable as easily and efficiently as possible, and worry about a native implementation later.

That said, looking at the Go implementation it really shouldn't be that difficult to implement our own algorithm.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Apr 25, 2024

It's impossible to reach feature parity with getaddrinfo (or equivalent). For example glibc/nsswitch allows extensible resolvers, so anybody can write, add and configure their own special resolver to handle special cases (e.g. custom domain, mDns, whatever you can think of).

Let us pause to contemplate the idea of using plugins at the libc level, instead of running a local resolver, that allows such plugins 😮‍💨

Using a custom DNS resolver, be it libevent dns or a pure Crystal one, means bypassing all that customization. Even going through a custom DNS resolver then fallback to getaddrinfo is prone to errors, because the former may resolve while the latter is customized to return something else 😭

Let's not even talk about security extensions (DNSSEC, DoT, DoH, DNSCrypt).

The dedicated thread is still likely a good idea. Even with an io/event aware DNS resolver, we'd still need the ability to call getaddrinfo. The advantage of Go is that it doesn't need a dedicated thread (it 'merely' detaches the scheduler from the thread) and there can be multiple concurrent calls to getaddrinfo.

@straight-shoota
Copy link
Member

On Linux there's also getaddrinfo_a for asynchronous queries. It's a GNU extension and doesn't seem to be supported outside of glibc.
I haven't looked too deeply into how it's implemented, but it could be a bit of a challenge to integrate it with the event loop.
Also, since it only works with glibc, we'll need a generic solution for other targets anyway. So I don't think it's much worth persuing it.

@HertzDevil
Copy link
Contributor

HertzDevil commented Sep 1, 2024

  • getaddrinfo_a notifies the caller via a POSIX signal or a callback in a new thread.
  • Darwin has something called getaddrinfo_async_start; it is used by Bun, other than that I couldn't find any documentation about it. Another system API is DNSServiceGetAddrInfo for macOS 10.12+.
  • On Windows 8 or above, the asynchronous Win32 API is DnsQueryEx or GetAddrInfoExW. Both support asycnhronous callbacks, the latter overlapped I/O also.

I don't think we really have to stick to using getaddrinfo exactly, as long as other system APIs also provide feature parity with respect to those customizations.

@stakach
Copy link
Contributor

stakach commented Sep 2, 2024

building a crystal native implementation sounds like the easiest way to provide cross platform non-blocking support.
I like the idea of being able to switch out the implementation too - making it easier to add support for mDNS or similar to 3rd party libraries that one might be using in an application (or maybe a way to switch implementation on matching regex .local for instance)

chatgpt pumped out a robust (albeit basic) crystal implementation supports IPv4, IPv6 and ALPN defaulting to using /etc/resolv.conf servers, pipelining queries and multiple DNS servers with timeouts

straight-shoota pushed a commit that referenced this issue Sep 5, 2024
Moves the platform-specific code into a separate module, so that implementations other than `LibC.getaddrinfo` can be added without cluttering the same file (e.g. Win32's `GetAddrInfoExW` from #13619).
@straight-shoota
Copy link
Member

straight-shoota commented Oct 15, 2024

I found this great resource explaining details of the different implementations for DNS resolves on Linux:

https://biriukov.dev/docs/resolver-dual-stack-application/0-sre-should-know-about-gnu-linux-resolvers-and-dual-stack-applications/

Haven't read all of it yet. But I think an easy takeaway is that getaddrinfo_a is just running getattrinfo within a thread pool. I don't think it worth it using this.
In order to support async getaddrinfo on Unix systems that don't have the GNU extension getattrinfo_a (which includes musl), we'll need our own threadpool-based implementation anyway. Then we can just use that on all Unix platforms, including glibc.

@straight-shoota
Copy link
Member

Another observation is that the implementation of getaddrinfo in musl is very basic compared to glibc (ref #13619 (comment)). I figure it should be relatively easy to implement the equivalent in Crystal natively. The great benefit would be using Crystal's concurrent IO and not occupying an OS thread.
I presume the implementations of getaddrinfo on other Unix systems will be more on the simple side and not as bloated as glibc's.

@straight-shoota
Copy link
Member

straight-shoota commented Oct 16, 2024

There's also an overview of which resolvers are used in popular programming languages' standard libraries:

  • Python: getaddrinfo + optional async version running in a thread pool
  • Go: Go implementation or getaddrinfo via cgo (threaded via scheduler semantics). Go impl is usually preferred, but under some conditions it prefers getaddrinfo (e.g. on OS X, or with certain configurations that the native implementation cannot handle). So Go impl looks like a plug-in optimization for simple standard configurations.
  • Rust: getaddrinfo (not sure if threaded, but seems like no)
  • Java: getaddrinfo (not sure if threaded)
  • Node.js: getaddrinfo via libuv (in a thread pool)

@crysbot
Copy link

crysbot commented Oct 18, 2024

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/charting-the-route-to-multi-threading-support/7320/1

@yxhuvud
Copy link
Contributor

yxhuvud commented Oct 18, 2024

One way to get started fast could be to start with one of the implementations for ruby, translate that and then refactor until happiness ensues. Available options there are https://github.com/ruby/resolv and very perhaps https://github.com/socketry/rubydns even if that one mainly is a server and therefore has more functionality than we want. If nothing else they could be good places to look for inspiration.

@straight-shoota
Copy link
Member

The implementations in Go and musl libc should also be good inspiration.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Oct 19, 2024

I'll state it again: languages don't implement custom DNS because it's much harder and complex than it first looks like.

We already removed support for libevent DNS because it failed in practice in simple setups, because glibc, nsswitch, bonjour, docker DNS, etc. See #2660 #2426 #2745 and the issues they link to.

Stdlib should merely provide a configurable backend and either keep the current blocking behavior and/or delegate to a pool of threads. Then shards can implement custom DNS resolvers and applications replace the default adapter.

@straight-shoota
Copy link
Member

straight-shoota commented Oct 19, 2024

Not sure I agree without reservations. I think it depends a lot on use case.

The complications only apply if you expect feature parity with getattrinfo, particularly the GNU variant. But that complexity may not always be necessary or desirable. For example, it makes it hard to reason about what's going on.
System-specific resolvers can cause an application to behave differently on different platforms, including statically (musl) vs. dynamically (glibc) linked executables on Linux.

That being said, this would be very low priorty. I don't think we will see a resolver implementation in stdlib anytime soon. If ever.

An immediate action should be to ensure getaddrinfo can run in a thread without blocking other fibers. This shouldn't be hard with the upcoming execution contexts.
And we should see what's necessary to integrate other resolver implementations from user code.

@stakach
Copy link
Contributor

stakach commented Oct 22, 2024

I created a shard with a possible implementation: https://github.com/spider-gazelle/dns

Currently features:

  • customisable caching (cache interface)
  • customisable resolvers and servers
    • supports mDNS on .local domains
    • allows the default resolver to be replaced (defaults to UDP, could include a getaddrinfo one, https, etc)
  • extendable resource processing (if you need to extract an ISDN record you can implement the parser etc)
  • customisable timeouts and failure limits (leading to server demotion)

Happy to work on getting this into crystal std lib with community collaboration if there is interest / people think it's going in the right direction - or at least hooks in the std lib so an alternative resolver can be used without monkey patching

@stakach
Copy link
Contributor

stakach commented Oct 30, 2024

I have https://github.com/spider-gazelle/dns probably resolving most of the issues seen in previous getaddrinfo replacements

  • mDNS support
  • hosts file parsing
  • system specific DNS configuration parsing (windows, mac and unix variants)
  • resolvers for UDP (default), HTTPS, TLS, mDNS and getaddrinfo fallback

I would like to progress this into something that might be considered for inclusion in the std lib
AND/OR
implementing an official way to replace the default resolver as @ysbaddaden mentioned.

My opinion is:

  1. keep the existing behavior as the standard - use an async implementation like that implemented on Windows where possible and include "dns" automatically with socket for platforms where an async implementation isn't available
  2. provide a supported and standard interface for replacing the default resolver (I added a monkey patch for Socket::Addrinfo that doesn't seem ideal)
  3. inclusion of DNS in the std lib - making it simpler to add support for custom resolvers in applications and as a way to provide consistent cross platform DNS resolution behaviour

I imagine something like this in the standard lib (of course no one would be forced to use this implementation either, assuming point 2 is implemented)

# configures the standard lib DNS classes to handle DNS resolutions (as per point 2)
# by default passing through to getaddrinfo (i.e. no perceivable change in behaviour when included in an application)
# on platforms without async support, runs getaddrinfo in a thread pool so it doesn't block the reactors
require "dns"

# configure a custom resolver to be used by default (break through any corporate filters for example)
DNS.default_resolver = DNS::Resolver::HTTPS.new(["https://1.1.1.1/dns-query"])

# configures mDNS support, platforms like windows before win10 don't have it and it's often disabled on enterprise devices
DNS.resolvers[/.+\.local$/i] = DNS::Resolver::MDNS.new

Does the above sound like a way forward?

@atlantis
Copy link

atlantis commented Nov 1, 2024

This sounds amazing to me! Even if getaddrinfo remains the default implementation, I would argue that it's a kindness to future Crystal developers to call in its own thread, if at all possible - it was a bit of a shock to see my entire near-realtime embedded application hang for 20s the first time it lost internet, in spite of all my careful spawn planning.

Granted my use-case is non-standard, but I believe that even a vanilla Crystal use-case like a high-performance Kemal app could easily be affected by this design decision (e.g. an app that routinely calls customer-specified webhook urls might periodically hang for a few seconds until getaddrinfo returns)?

So if it's feasible to have the default getaddrinfo call in its own thread, seems like that's clearing out a potential trap that would otherwise be lurking once you start using Crystal at scale. 🤷‍♂️

@stakach
Copy link
Contributor

stakach commented Nov 5, 2024

@atlantis we could make the DNS lib native resolver perform the resolutions in a thread pool on platforms that don't support async - which would solve the issue

Using native async system calls would be lighter weight so I think doing that by preference would work best.
On platforms where async isn't supported we would require "dns" anytime socket is required to provide async DNS transparently

That way we don't complicate things beyond the 3 points I outlined above (I'll edit the message with these changes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants