-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns.lookup blocks filesystem I/O #8436
Comments
Yes, unfortunately this has been a known issue for a while but there's not a clear solution just yet outside of a massive (and non-backwards compatible) refactoring of the |
Thanks for the prompt response. Just to clarify, I'm referring to the fact that the underlying implementation of |
Little of both, really. There may be ways of tweaking the current implementation to improve this particular issue without breaking things but doing so would largely just end up being a band aid. There are several issues with the current implementation that make it less than desirable. |
Upvoting this issue. We had this recently hit one of our applications in a very non-intuitive way. The node process we are running acts as a monitor for hundreds of host names. We are pinging the servers using After roughly 10 hours searching the internet on how outgoing requests could cause the fs to block I finally found this thread. Based on it, and the The other complicating factor is that in the So while this behavior might be known, the results of it are very, very unintuitive and difficult to debug. The idea that a Is there a list of specific configuration files that differ between Any guidance is appreciated. |
Can you report that as a bug, please? That should be fixable.
Yes, that will keep node out of the thread pool and on the main thread, solving your contention problem. Local DNS setup can help with the problem of just getting faster DNS responses. Increasing the threadpool size from |
Btw, no need to increase threadpool size if you can avoid using the threadpool, but fs is using the pool, so its still worth looking at. |
Is there an objection to exposing a setting on Yes, the behavior I seek can be accomplish by doing things like |
dns calls are no more synchronous than fs calls, they aren't. They do contend with the thread pool. What is happening with DNS provider timeouts? Is the thread in the (small) pool taking so long that new lookups are blocked, waiting for a free thread? Has pre-resolving the hostnames not worked for you? There has been some interest in having a pluggable resolver, but no-one has stepped up to do the work, yet. Note that user-land modules like request can already implement options to resolve with dns.resolve instead of dns.lookup. |
@sam-github The problem arose for us that our DNS server was taking 5s+ to respond, due to issues with the server itself. The node process was responsible for heartbeating thousands of different host names via http requests every 5 minutes. When the dns server slowed, then the I have reproduction of the fs lock-up below: https://gist.github.com/owenallenaz/fc3d8b24cdaa4ec04cc9dd0bf8ab485f In that case, by utilizing The main reason I bring this up here if someone does not care about host file interactions, it would seem to me that the default behavior really should be |
Thanks for the war-story, its good for them to be heard. I'm a bit surprised you don't also have a mem leak in the Btw, arguably, async itself is unintuitive (at least, lots of people have had difficulty wrapping their head around it), so general statements about how some node API was found unintuitive to some person aren't really compelling reasons to make the API more complex. I find the docs are deficient, they have no description of what is async-on-epoll, and what is async-in-the-thread-pool, and I intend to fix that. I think that clearly documenting that, as well as All PRs are considered, of course. But you want a predictor of how positive people would feel about it before wasting your time, understandable. Personally, I'd be not inclined towards it, here is why. function get(host, ...) {
dns.resolve(host, function(ip) {
http.get(ip, ...)
}
} vs function get(host, ...) {
http.get(ip, {
hostLookup: 'resolve'
}, ...)
} The first looks like functional composition to me, nice APIs, that work well together, the second just looks like an ugly hack. IMHO, of course. Also, the options knob to change the underlying function in dns.* to use doesn't allow more interesting usage, like making the dns implementation completely user-selectable. This particular suggestion is probably awful, but as a straw-man, I'd be more inclined to something like function get(host, ...) {
http.get(host, {dns: myDns}, ....);
} Where myDns (and built-in Basically, I'd prefer to see an approach to a more pluggable DNS, that does more than what can be already achieved by functional composition, and otherwise, just document that node APIs can be used together to get specific results. |
Sounds good. I'll move my nagging up the chain to To your question, I end up with With |
Hi guys, Good to see there is still work going on surrounding this issue. Would just like to re-affirm that this issue is still relevant. I submitted a issue to karma not to long ago which they investigated and traced to this bug. They mentioned this issue: nodejs/node-v0.x-archive#25489 as a reference. Unfortunately it is in the archives (and I don't think someone is tracking it). So I'm partially commenting on here so I have something to track concerning my Karma issue: karma-runner/karma#2050 EDIT: I'll read the entire thread later today (it's morning here and have to go to work soon). I read something about using |
@RobertDiebels nodejs/node-v0.x-archive#25489 isn't at all related to this issue. |
@sam-github my bad. This issue: nodejs/node-v0.x-archive#25338 gave me the impression that it was. It is linked to the one from before: nodejs/node-v0.x-archive#25489 |
Why not have multiple thread pools? File system I/O has nothing to do with networking/DNS anyway. This has been an ongoing issue for years. In our application, we had to patch I should also mention that Edit: As an improvement to our workaround, one could always fallback to |
@bminer why not always use Problem with multiple is it raises the question of "how many"... all C++ code using libuv to queue work off the main thread is using the same pool and potentially contending for those threads (including fs, dns.lookup, crypto.random, SQL native DB drivers, etc...). Should each one have its own pool? That makes it hard to control overall amount of concurrency, the threads in use could explode as the number of pools does, without a decent way to manage them overall. And what about fs.read() and fs.write(), should they each have their own pools? Is it just dns.lookup() that is a special case? |
Interesting point. After thinking about this for a bit, I think But, My suggestion is to give Another idea is to allow the creation of multiple thread pools and assigning different tasks to different pools. Perhaps the default configuration is a single thread pool with 4 threads -- all blocking activity uses that pool. But, the end-user could customize things to allow |
Fair enough, it could be done, at some cost in time, by someone sufficiently motivated. War stories may provide that motivation. You didn't answer my question, why not always use |
This is how it could work:
This design should be amenable to CPU pinning although I don't have experience with that. Calls such as NAN could expose an option to have an By default, unless indicated, everything should run in the IO threadpool. See this and this regarding 128 threads for the IO threadpool. @bnoordhuis 's comment in the second link could be handled by migrating some hot |
@sam-github Sorry, I didn't answer your question. We needed to use the @jorangreef - How does your suggestion solve the problem at hand? The idea is to separate DNS into its own thread pool to avoid I/O thread starvation when DNS/networking problems crop up. It begs the question, will 128 threads be enough? One can easily build up 128 DNS pending requests when handling dozens of requests per second. |
(war story follows)... I should also disclose that we bumped into this problem because we have PCs connected to the Internet via cellular modems running a Node process. Cell modem connections are iffy at best, and connections drop a lot. Whenever the connection was lost, the entire Node application became unresponsive for a few minutes. This problem was very difficult to debug. Eventually, we just patched When using Node on a server, 99.99% of the time you probably don't care about DNS sharing the thread pool with other I/O, but when you're using Node on PCs where DNS can regularly fail, this issue matters quite a lot. |
I'm not suggesting that Node attempt to solve every application's cellular modem issue perfectly. Your example could equally be reframed along many dimensions leading to an explosion in the number of threadpools required. For example, say you're using 32 hard drives in a server, and one drive goes bad and blocks requests, blocking all 128 threads (or all 1024 threads if that were possible etc.). Your logic of having DNS in its own threadpool would similarly suggest having a separate threadpool per failure domain, i.e. one threadpool per hard drive?
It's better than the default of 4 threads. With 128 threads, you've given it enough concurrency, and at that point, there's probably something wrong with your network which would likely overflow any threadpool you throw at it. That kind of thing should be monitored and handled using a queue in your application, before requests hit the IO threadpool. When you see the queue is overflowing, your application should do something about it, rather than encouraging the queue to keep growing (by throwing more threads at it). Perhaps Node could help solve this with a So I disagree that Node should have a DNS-only threadpool as you suggest. But there is a problem with Node's threadpool design. The basic issue in all the recent threadpool discussions has been that the current threadpool design conflates CPU-intensive tasks with IO-intensive tasks, and uses a default of 4 threads which would be good for CPU-intensive tasks but bad for IO-intensive tasks. What you want for CPU-intensive tasks such as Two threadpools, one for CPU-intensive tasks and one for IO-intensive tasks would solve the majority of problems people have with Node's threadpool, allowing classification of tasks and pinning, as well as reasonable concurrency for IO tasks (the 4 threads default is not reasonable), without conflating CPU-intensive tasks with IO-intensive tasks. |
@jorangreef it would be even better to make IO pool dynamically sized |
@jorangreef - Fair enough. I agree with what you're saying. 128 threads for I/O seems like a lot, but... I agree with your proposal of a separate CPU threadpool with one thread per CPU core. I wonder... how much RAM is consumed by a sleeping thread? Anyway, the size of both threadpools should be configurable by the user via an environment variable or something. I also like your suggestion of having Node limit the number of concurrent |
@bminer I thought @jorangreef I like your idea of seperating pools by purpose (blocking or cpu-busy), but I also think for most use-cases its the same as "increase the threadpool size to 128", since other than the two crypto RNG generation APIs you listed, I'm hard pressed to think of a time I've seen an addon delegate CPU intensive work to the UV threadpool. The pool is almost entirely used to make (potentially) blocking calls, and the threadpool size can be increased already by anybody who wants to with UV_THREADPOOL_SIZE, Btw, I'm not for or against increasing the pool size, but it might cause poorer performance for some workloads, and help others. Its hard to get good feedback on this, which I think is why it was just left user-configurable. I'm not sure how much evidence would be needed to show this change is usually beneficial. It might be worth bring up as a LEP: https://github.com/libuv/leps |
Reopening until libuv/libuv#1845 lands in node |
this (libuv/libuv#1845) is landed in node through #22997 , closing |
@gireeshpunathil Do you have any idea about how one could write regression test in Node.js for this? |
@addaleax - sure, let me think about it, and if possible write one, and keep this opened till then so that I don't forget. |
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: nodejs#8436 PR-URL: nodejs#23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
a regression test is in place (54fa59c), closing. |
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: nodejs#8436 PR-URL: nodejs#23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter). Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same. Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O. Refs: libuv/libuv#1845 Refs: #8436 PR-URL: #23099 Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Anna Henningsen <[email protected]>
This helps prevent DNS poisoning attacks if the platform supports DNSSEC since dns.resolve4 uses c-ares, which doesn't support DNSSEC.
Hi all,
I'm posting this issue as a continuation of an issue originally opened against node-serialport: serialport/node-serialport#797
On networks with slow DNS response, or where DNS requests time out and fail, we observe that the blocking calls to
getaddrinfo
issued bydns.lookup
saturate Node'slibuv
threadpool and delay our serialport or filesystem IO.It looks like this issue has come up many times before. The most pertinent example I could find is nodejs/node-v0.x-archive#2868, which outlines exactly this issue. However, it was closed in favor of nodejs/node-v0.x-archive#8475, in which the API of
net.connect
was changed to allow a user-specified DNS resolution handler.The original proposal of issue 2868, which included a change to
dns.lookup
, seems to have been lost in the consolidation to issue 8475. In our case, we'd like to be able to use the OS-level DNS facilities like a local DNS cache, so usingdns.resolve
does not seem to be an equivalent substitute to usingdns.lookup
. Furthermore, our application uses multiple high-level modules that wrap REST APIs with their own request modules, so changing every call tonet.connect
to use a custom DNS function is not feasible.There seems to be a relevant open issue against
libuv
on the matter at libuv/libuv#203. Through various issue reports, I've seen proposals inlibuv
to usegetaddrinfo_a
, to segregate the threadpools, or to cap the number of threads DNS requests could use, to avoid DNS requests starving all other IO.Unfortunately, although the synchronicity of
dns.lookup
is documented at https://nodejs.org/api/dns.html#dns_implementation_considerations, its behavior is not intuitive given that other network requests are truly asynchronous, and it has resulted in many downstream issues, including our seeming issue with serialport.The text was updated successfully, but these errors were encountered: