-
-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage with TLS #1673
Comments
Flat profile: Each sample counts as 0.01 seconds. |
Here is another profile output, but this time under what for us is a normal load: 16 users, each streaming approx 220kBps at on the order of 100-200 message per second each. Flat profile: Each sample counts as 0.01 seconds. |
Ah that's definitely possible, I'll take a look. And thanks for letting me about using latest. By the way, I really appreciate this library and especially the support from the main devs like you. Thanks very much for all of your help with the (very few) issues I've hit. |
I rebuilt mbedtls with pthread and updated NNG to the latest, but I'm still seeing high CPU usage. This time, there are no active connections at all. I have a REP socket open and listening, with no incoming REQ connections and no other sockets, and seeing ~30% CPU usage on an t2.medium EC2 node. I set a breakpoint on the hotpath from the earlier gprof output, and it does seem like NNG is basically tight looping on this stack in one of the background threads. The name of the thread is "nng:aio:expire". #0 nni_list_next (list=list@entry=0x555555c48ce0, item=item@entry=0x7fffee2bf7f0) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/list.c:113 |
Just thinking about this - are you sure NNG is linking to your updated If that is not it, then is it pair 1 that is causing the trouble as per the issue title, or is it req/rep as you mention in your replies? I don't have any experience of pair1 myself, so this would have to be something for gdamore. |
The only reason I can think of for the nni_aio_expire loop to be spinning like this is either a serious defect that I don't understand, or the time keeping is busted, or you have a lot of unexpired timers (aios). The code does a somewhat painful linear walk, but if you're not spamming aio's it shouldn't be too bad. Are you creating aios on demand? |
I'm probably going to close this unless we can get some confirmation about the above questions. |
Or alternatively, a simple reproduction case. |
Sorry for the wait! Will it suffice for me to give you a prebuilt linux binary with symbols, statically linking to NNG, along with the NNG source tree used to build it? Alternatively I could package NNG as a solib. I also haven't updated NNG since I originally reported this. Perhaps I should do that first to check it hasn't been fixed. |
Please do try updating first.
• Garrett
…On Dec 3, 2023 at 2:08 PM -0800, James Ritts ***@***.***>, wrote:
Sorry for the wait! Will it suffice for me to give you a prebuilt linux binary with symbols, statically linking to NNG, along with the NNG source tree used to build it?
Alternatively I could package NNG as a solib.
I also haven't updated NNG since I originally reported this. Perhaps I should do that first to check it hasn't been fixed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Sorry, was mixing this ticket up with a different. |
I'm still wondering about the inordinately high use in the aio_expire loop... this area is one that has had historic problems... I'd still like a reproduction case if possible. I don't really want a binary -- as I need to review the source to see if there is something I'm missing. If you can strip it down to the minimal parts (I don't want to see your IP) that would help. |
For sure, and sorry for the delay - I'll work to get you something tonight, if I can reproduce it in an isolated test with tip NNG. |
Very sorry Garrett, I'm not sure I gave correct information in comments earlier this year; the server actually creates many PAIR1 listeners. I've attached a simple test that uses the tip as of tonight, 5750ca1. There are some settings at the top of the file for the cert path and how many ports to open. Below is what I see on an AWS EC2 t2.medium instance, running Ubuntu 22.04 LTS. It creates 50 PAIR1 TLS sockets, then sleeps indefinitely with no connections, which hits ~20% CPU. If I switch to plain tcp, usage drops to less than half that. CPU% seems to scale pretty linearly with the socket count. Is this what you'd expect?
|
I think I might know what is going on. I think the aio expiration queue might spin when it goes from having items with a timeout to being empty. And if there is nothing timing out, then it will indeed go empty. I'm going to test some things to find out. |
Well, nni_time is unsigned... so maybe not. Hm... |
The aio for connections was meant to have an infinite sleep (no timeout), but was getting an initial value of zero, so we were spinning on accept.
You can try this: https://github.com/nanomsg/nng/pull/1725/files I suspect strongly that this was the problem. We were almost certainly spinning in accept(). |
If you confirm that this fixes the problem, I will cut a release with it. Because that's clearly a pretty serious defect. |
I merged the change, and while this "closed" the ticket, I'd like to know if you can verify that it actually resolves it. I think it will. If you find otherwise, please let me know and we'll open another ticket. |
It actually slightly increased CPU usage by a few percentage points - 23 to 25.5. Just to be sure the change was built properly, I added a log to tls_alloc() and it's printed 50 times, once per socket.
|
Ooof. This is kind of crazy. (Previous comment edited because that was something else.) |
Does the server start in that state, or does it only happen if you're connected to it? What if all the clients disconnect? Does it return to idle? |
Your comments mention starting sockets ,but not actually connecting anything.... |
So I'm running the test now ... You have code that is timing out every 50 millseconds -- for each socket. You have 50 sockets. That is potentially 1000 wakeups per second. That's where the time is going. Now I have a theory as to why this is worse with TLS. Essentially, if you're using TCP, then the setup is very cheap and fast, and we all these timeouts will basically land in the same tick. In a perfect world they all set up within a msec, then you're only doing 20 wakeups per second. Probably you're not in the pathological case for TLS, but some of the activity is going to take longer .. definitely reading the file from disk is probably going to affect things, and we don't really have a clear picture on how long mbedTLS is going to take to process the certificate and key. A trivial way to test for this is to bump your read timeouts up quite a bit. Try 5 seconds instead. Having lots of short timeouts is a scalability problem for NNG, and I have some ideas on how to address them in the future, but ultimately it's always going be the case that the fast code paths is reserved for operations that do not timeout. In a healthy system, that's what we expect to see. Conversely if you're just posting a read here because you want to have a blocking read -- then make it a blocking read. :-) One thing that you should not do is use short timeouts so you can go do other work. If you need to do that, its better to have a single nng_sleep_aio() -- which can be used to run a periodic callback -- but have just one of those, and let all other socket I/O either sleep or use longer sensible timeouts. |
Actually, the printf probably exacerbated the problem by spreading the timeouts too! (By slightly extending the setup time, so making it less likely that any two sockets would wind up getting initialized with the same final timeout.) |
It starts in that state, no connections - I'm just building and running that single source file, then running 'top'. I'll share a profile first thing in the morning. I can also write a new makefile for it within a docker container so it's easy to reproduce; and rule out (or not) some sort of toolchain thing or dependency being the root cause. |
Will share a profile with tls disabled too in case it's useful. Should I try any socket opts (nonblocking etc)? |
Please see above ^^ You should just change the timeout for the recvmsg to something a lot larger than 50 ms. |
Yep got it now (didn't see your new comment until after I submitted mine). Makes sense and thanks for all the detail. Will report back. |
Messages are queued to the main thread, so the timeout was essentially just setting the polling period for shutting down the server, so no need at all for a small value. Is there a way to unblock nng_recvmsg, short of spoofing a message? Thanks very much again for your help. |
Closing the socket will wake the receivers. |
We're building and running a server application on both Linux and Windows which accepts a single PAIR1 client with TLS enabled via mbedtls.
Minimal data is being sent while profiling - just periodic keep-alive messages.
We are running this commit from earlier this year, at the recommendation of @shikokuchuo from another thread (#1661):
commit 8e1836f
Date: Thu Feb 9 00:48:17 2023 +0100
On Linux, we see unexpectedly high CPU usage (20-30% single core) in NNG, given the small workload. Windows CPU usage is near 0 for the same server code and use case. The gprof output for Linux is below.
Is there a known performance problem as of this older commit? Or do you have any suggestions for how to investigate further? Thank you for your time.
NNG version: 8e1836f
Operating system and version: 5.15.0-1028-aws
Compiler and language used: clang / C++
Shared or static library: static
The text was updated successfully, but these errors were encountered: