Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage with TLS #1673

Closed
jritts opened this issue Jul 27, 2023 · 32 comments
Closed

High CPU usage with TLS #1673

jritts opened this issue Jul 27, 2023 · 32 comments

Comments

@jritts
Copy link

jritts commented Jul 27, 2023

We're building and running a server application on both Linux and Windows which accepts a single PAIR1 client with TLS enabled via mbedtls.

Minimal data is being sent while profiling - just periodic keep-alive messages.

We are running this commit from earlier this year, at the recommendation of @shikokuchuo from another thread (#1661):

commit 8e1836f
Date: Thu Feb 9 00:48:17 2023 +0100

On Linux, we see unexpectedly high CPU usage (20-30% single core) in NNG, given the small workload. Windows CPU usage is near 0 for the same server code and use case. The gprof output for Linux is below.

Is there a known performance problem as of this older commit? Or do you have any suggestions for how to investigate further? Thank you for your time.

NNG version: 8e1836f
Operating system and version: 5.15.0-1028-aws
Compiler and language used: clang / C++
Shared or static library: static

@jritts
Copy link
Author

jritts commented Jul 27, 2023

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
85.87 0.79 0.79 nni_aio_expire_loop
7.61 0.86 0.07 nni_list_next
2.17 0.88 0.02 nni_clock
1.09 0.89 0.01 mbedtls_mpi_core_mla
1.09 0.90 0.01 nni_cv_wake
1.09 0.91 0.01 nni_http_server_init
1.09 0.92 0.01 nni_list_first
0.00 0.92 0.00 838 0.00 0.00 statusAddString(char const*, int, char const*, ...)
0.00 0.92 0.00 817 0.00 0.00 getUptimeSecs()
0.00 0.92 0.00 817 0.00 0.00 getWallTimeMillis()
0.00 0.92 0.00 806 0.00 0.00 DataServer::recvFspMsgTimed(DataClient**, long)
0.00 0.92 0.00 806 0.00 0.00 DataServer::tick()
0.00 0.92 0.00 802 0.00 0.00 moodycamel::spsc_sema::LightweightSemaphore::waitWithPartialSpinning(long)

@jritts jritts changed the title Slow performance of PAIR1 High CPU usage of PAIR1 server on Linux Jul 27, 2023
@jritts
Copy link
Author

jritts commented Jul 27, 2023

Here is another profile output, but this time under what for us is a normal load: 16 users, each streaming approx 220kBps at on the order of 100-200 message per second each.

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
84.71 24.65 24.65 nni_aio_expire_loop
9.14 27.31 2.66 nni_list_next
1.10 27.63 0.32 nni_clock
0.79 27.86 0.23 nni_list_first
0.76 28.08 0.22 chacha20_block
0.55 28.24 0.16 _init
0.38 28.35 0.11 nni_cv_wake
0.38 28.46 0.11 nni_plat_cv_wake
0.34 28.56 0.10 poly1305_process
0.31 28.65 0.09 mbedtls_mpi_core_mla
0.14 28.69 0.04 mbedtls_mpi_core_shift_r
0.07 28.71 0.02 mbedtls_base64_decode
0.07 28.73 0.02 mbedtls_internal_sha256_process
0.07 28.75 0.02 mbedtls_mpi_core_sub
0.07 28.77 0.02 nni_list_append
0.03 28.78 0.01 27774 0.36 0.54 DataServer::recvFspMsgTimed(DataClient**, long)
0.03 28.79 0.01 18645 0.54 0.54 FspMsg2::parse(unsigned int)
0.03 28.80 0.01 9263 1.08 1.08 TakeRecorder::append(FspMsg2 const&)
0.03 28.81 0.01 ecp_modp
0.03 28.82 0.01 ecp_mul_restartable_internal
0.03 28.83 0.01 mbedtls_chacha20_starts
0.03 28.84 0.01 mbedtls_chacha20_update
0.03 28.85 0.01 mbedtls_chachapoly_finish
0.03 28.86 0.01 mbedtls_chachapoly_update
0.03 28.87 0.01 mbedtls_cipher_auth_encrypt_ext
0.03 28.88 0.01 mbedtls_debug_print_msg
0.03 28.89 0.01 mbedtls_mpi_add_abs
0.03 28.90 0.01 mbedtls_mpi_cmp_abs
0.03 28.91 0.01 mbedtls_mpi_cmp_mpi
0.03 28.92 0.01 mbedtls_mpi_core_montmul
0.03 28.93 0.01 mbedtls_mpi_mul_mpi
0.03 28.94 0.01 mbedtls_mpi_sub_abs
0.03 28.95 0.01 mbedtls_platform_zeroize
0.03 28.96 0.01 mbedtls_poly1305_update
0.03 28.97 0.01 mbedtls_ssl_fetch_input
0.03 28.98 0.01 mbedtls_ssl_get_mode_from_transform
0.03 28.99 0.01 nng_tls_engine_recv
0.03 29.00 0.01 nni_aio_begin
0.03 29.01 0.01 nni_aio_set_iov
0.03 29.02 0.01 nni_list_node_remove
0.03 29.03 0.01 nni_mtx_unlock
0.03 29.04 0.01 nni_plat_cv_wake1
0.03 29.05 0.01 nni_plat_init
0.03 29.06 0.01 nni_plat_mtx_lock
0.03 29.07 0.01 nni_plat_mtx_unlock
0.03 29.08 0.01 tls_tcp_recv_cb
0.03 29.09 0.01 tlstran_pipe_send
0.03 29.10 0.01 tlstran_pipe_send_cb
0.00 29.10 0.00 47947 0.00 0.00 getWallTimeMillis()
0.00 29.10 0.00 37143 0.00 0.00 getUptimeSecs()

@shikokuchuo
Copy link
Contributor

@jritts apologies about the commit information - it should be fine to use the latest. See my comments at #1662

As you are using TLS, perhaps you made the same mistake of building mbedtls without pthread support? that would explain why there is a high load on Linux vs Windows.

@jritts
Copy link
Author

jritts commented Jul 27, 2023

Ah that's definitely possible, I'll take a look. And thanks for letting me about using latest.

By the way, I really appreciate this library and especially the support from the main devs like you. Thanks very much for all of your help with the (very few) issues I've hit.

@jritts
Copy link
Author

jritts commented Jul 27, 2023

I rebuilt mbedtls with pthread and updated NNG to the latest, but I'm still seeing high CPU usage.

This time, there are no active connections at all. I have a REP socket open and listening, with no incoming REQ connections and no other sockets, and seeing ~30% CPU usage on an t2.medium EC2 node.

I set a breakpoint on the hotpath from the earlier gprof output, and it does seem like NNG is basically tight looping on this stack in one of the background threads. The name of the thread is "nng:aio:expire".

#0 nni_list_next (list=list@entry=0x555555c48ce0, item=item@entry=0x7fffee2bf7f0) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/list.c:113
#1 0x0000555555644c13 in nni_aio_expire_loop (arg=0x555555c48c80) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/aio.c:583
#2 0x000055555564ebfa in nni_thr_wrap (arg=0x555555c48cf8) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/thread.c:94
#3 0x0000555555650803 in nni_plat_thr_main (arg=0x555555c48cf8) at /home/ubuntu/fsp-net/thirdparty/nng/src/platform/posix/posix_thread.c:266
#4 0x00007ffff7a57b43 in start_thread (arg=) at ./nptl/pthread_create.c:442
#5 0x00007ffff7ae9a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)

@shikokuchuo
Copy link
Contributor

Just thinking about this - are you sure NNG is linking to your updated mbedtls? You can specify the path directly through cmake (see cmake/FindmbedTLS.cmake) but you need to check it picks it up correctly. Best is just to eliminate any other versions on the system.

If that is not it, then is it pair 1 that is causing the trouble as per the issue title, or is it req/rep as you mention in your replies? I don't have any experience of pair1 myself, so this would have to be something for gdamore.

@jritts jritts changed the title High CPU usage of PAIR1 server on Linux High CPU usage on Linux Jul 28, 2023
@gdamore
Copy link
Contributor

gdamore commented Sep 14, 2023

The only reason I can think of for the nni_aio_expire loop to be spinning like this is either a serious defect that I don't understand, or the time keeping is busted, or you have a lot of unexpired timers (aios). The code does a somewhat painful linear walk, but if you're not spamming aio's it shouldn't be too bad.

Are you creating aios on demand?

@gdamore
Copy link
Contributor

gdamore commented Dec 3, 2023

I'm probably going to close this unless we can get some confirmation about the above questions.

@gdamore
Copy link
Contributor

gdamore commented Dec 3, 2023

Or alternatively, a simple reproduction case.

@jritts
Copy link
Author

jritts commented Dec 3, 2023

Sorry for the wait! Will it suffice for me to give you a prebuilt linux binary with symbols, statically linking to NNG, along with the NNG source tree used to build it?

Alternatively I could package NNG as a solib.

I also haven't updated NNG since I originally reported this. Perhaps I should do that first to check it hasn't been fixed.

@gdamore
Copy link
Contributor

gdamore commented Dec 3, 2023 via email

@gdamore gdamore changed the title High CPU usage on Linux High CPU usage with may REQ contexts on Linux Dec 15, 2023
@gdamore gdamore changed the title High CPU usage with may REQ contexts on Linux REQ contexts do not scale well Dec 15, 2023
@gdamore gdamore changed the title REQ contexts do not scale well High CPU usage on Linux with mbedTLS Dec 15, 2023
@gdamore
Copy link
Contributor

gdamore commented Dec 15, 2023

Sorry, was mixing this ticket up with a different.

@gdamore
Copy link
Contributor

gdamore commented Dec 15, 2023

I'm still wondering about the inordinately high use in the aio_expire loop... this area is one that has had historic problems...

I'd still like a reproduction case if possible. I don't really want a binary -- as I need to review the source to see if there is something I'm missing. If you can strip it down to the minimal parts (I don't want to see your IP) that would help.

@jritts
Copy link
Author

jritts commented Dec 16, 2023

For sure, and sorry for the delay - I'll work to get you something tonight, if I can reproduce it in an isolated test with tip NNG.

@jritts
Copy link
Author

jritts commented Dec 16, 2023

Very sorry Garrett, I'm not sure I gave correct information in comments earlier this year; the server actually creates many PAIR1 listeners.

I've attached a simple test that uses the tip as of tonight, 5750ca1. There are some settings at the top of the file for the cert path and how many ports to open.

Below is what I see on an AWS EC2 t2.medium instance, running Ubuntu 22.04 LTS. It creates 50 PAIR1 TLS sockets, then sleeps indefinitely with no connections, which hits ~20% CPU. If I switch to plain tcp, usage drops to less than half that. CPU% seems to scale pretty linearly with the socket count.

Is this what you'd expect?

$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 22.04 LTS
Release:    22.04
Codename:   jammy

$ ./nngtest &
NNG v1.6.0, Mbed-TLS v3.4.0

$ top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1062394 ubuntu    20   0  794592   6400   3840 S  22.9   0.2   0:02.31 nngtest

$ valgrind --tool=callgrind ./nngtest

$ callgrind_annotate callgrind.out.1062645 | less

--------------------------------------------------------------------------------
Ir                   file:function
--------------------------------------------------------------------------------
67,534,125 (28.36%)  fsp-net/thirdparty/nng/src/core/aio.c:nni_aio_expire_loop [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
46,651,147 (19.59%)  fsp-net/thirdparty/nng/src/core/list.c:nni_list_next [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
41,196,900 (17.30%)  fsp-net/thirdparty/mbedtls/library/constant_time.c:mbedtls_ct_base64_dec_value [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
21,974,350 ( 9.23%)  fsp-net/thirdparty/mbedtls/library/base64.c:mbedtls_base64_decode [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
 6,849,066 ( 2.88%)  ./nptl/./nptl/pthread_mutex_lock.c:pthread_mutex_lock@@GLIBC_2.2.5 [/usr/lib/x86_64-linux-gnu/libc.so.6]
 5,976,360 ( 2.51%)  fsp-net/thirdparty/nng/src/platform/posix/posix_clock.c:nni_clock [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
 5,846,980 ( 2.46%)  ./nptl/./nptl/pthread_mutex_unlock.c:pthread_mutex_unlock@@GLIBC_2.2.5 [/usr/lib/x86_64-linux-gnu/libc.so.6]
 5,267,867 ( 2.21%)  ./nptl/./nptl/pthread_cond_broadcast.c:pthread_cond_broadcast@@GLIBC_2.3.2 [/usr/lib/x86_64-linux-gnu/libc.so.6]
 2,529,560 ( 1.06%)  ./time/../sysdeps/unix/sysv/linux/clock_gettime.c:clock_gettime@@GLIBC_2.17 [/usr/lib/x86_64-linux-gnu/libc.so.6]

nngtest.cpp.txt

@gdamore
Copy link
Contributor

gdamore commented Dec 16, 2023

I think I might know what is going on. I think the aio expiration queue might spin when it goes from having items with a timeout to being empty. And if there is nothing timing out, then it will indeed go empty. I'm going to test some things to find out.

@gdamore
Copy link
Contributor

gdamore commented Dec 16, 2023

Well, nni_time is unsigned... so maybe not. Hm...

@gdamore gdamore changed the title High CPU usage on Linux with mbedTLS High CPU usage with TLS Dec 16, 2023
gdamore added a commit that referenced this issue Dec 16, 2023
The aio for connections was meant to have an infinite sleep (no
timeout), but was getting an initial value of zero, so we were spinning
on accept.
@gdamore
Copy link
Contributor

gdamore commented Dec 16, 2023

You can try this: https://github.com/nanomsg/nng/pull/1725/files

I suspect strongly that this was the problem. We were almost certainly spinning in accept().

@gdamore
Copy link
Contributor

gdamore commented Dec 16, 2023

If you confirm that this fixes the problem, I will cut a release with it. Because that's clearly a pretty serious defect.

@gdamore
Copy link
Contributor

gdamore commented Dec 16, 2023

I merged the change, and while this "closed" the ticket, I'd like to know if you can verify that it actually resolves it. I think it will. If you find otherwise, please let me know and we'll open another ticket.

@gdamore gdamore removed the feedback label Dec 17, 2023
@jritts
Copy link
Author

jritts commented Dec 17, 2023

It actually slightly increased CPU usage by a few percentage points - 23 to 25.5.

Just to be sure the change was built properly, I added a log to tls_alloc() and it's printed 50 times, once per socket.


diff --git a/src/supplemental/tls/tls_common.c b/src/supplemental/tls/tls_common.c
index d05a289d..826eb139 100644
--- a/src/supplemental/tls/tls_common.c
+++ b/src/supplemental/tls/tls_common.c
@@ -837,6 +837,9 @@ tls_get(void *arg, const char *name, void *buf, size_t *szp, nni_type t)
 static int
 tls_alloc(tls_conn **conn_p, nng_tls_config *cfg, nng_aio *user_aio)
 {
+       static int counter = 0;
+       fprintf(stderr, "tls_alloc %d\n", counter++);
+
        tls_conn *            conn;
        const nng_tls_engine *eng;
        size_t                size;
@@ -867,6 +870,7 @@ tls_alloc(tls_conn **conn_p, nng_tls_config *cfg, nng_aio *user_aio)
        nni_aio_list_init(&conn->send_queue);
        nni_aio_list_init(&conn->recv_queue);
        nni_mtx_init(&conn->lock);
+       nni_aio_set_timeout(&conn->conn_aio, NNG_DURATION_INFINITE);
        nni_aio_set_timeout(&conn->tcp_send, NNG_DURATION_INFINITE);
        nni_aio_set_timeout(&conn->tcp_recv, NNG_DURATION_INFINITE);

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Ooof. This is kind of crazy. (Previous comment edited because that was something else.)

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Does the server start in that state, or does it only happen if you're connected to it? What if all the clients disconnect? Does it return to idle?

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Your comments mention starting sockets ,but not actually connecting anything....

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

So I'm running the test now ... You have code that is timing out every 50 millseconds -- for each socket. You have 50 sockets. That is potentially 1000 wakeups per second. That's where the time is going.

Now I have a theory as to why this is worse with TLS.

Essentially, if you're using TCP, then the setup is very cheap and fast, and we all these timeouts will basically land in the same tick. In a perfect world they all set up within a msec, then you're only doing 20 wakeups per second. Probably you're not in the pathological case for TLS, but some of the activity is going to take longer .. definitely reading the file from disk is probably going to affect things, and we don't really have a clear picture on how long mbedTLS is going to take to process the certificate and key.

A trivial way to test for this is to bump your read timeouts up quite a bit. Try 5 seconds instead.

Having lots of short timeouts is a scalability problem for NNG, and I have some ideas on how to address them in the future, but ultimately it's always going be the case that the fast code paths is reserved for operations that do not timeout. In a healthy system, that's what we expect to see.

Conversely if you're just posting a read here because you want to have a blocking read -- then make it a blocking read. :-)

One thing that you should not do is use short timeouts so you can go do other work. If you need to do that, its better to have a single nng_sleep_aio() -- which can be used to run a periodic callback -- but have just one of those, and let all other socket I/O either sleep or use longer sensible timeouts.

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Actually, the printf probably exacerbated the problem by spreading the timeouts too! (By slightly extending the setup time, so making it less likely that any two sockets would wind up getting initialized with the same final timeout.)

@jritts
Copy link
Author

jritts commented Dec 17, 2023

It starts in that state, no connections - I'm just building and running that single source file, then running 'top'. I'll share a profile first thing in the morning.

I can also write a new makefile for it within a docker container so it's easy to reproduce; and rule out (or not) some sort of toolchain thing or dependency being the root cause.

@jritts
Copy link
Author

jritts commented Dec 17, 2023

Will share a profile with tls disabled too in case it's useful.

Should I try any socket opts (nonblocking etc)?

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Please see above ^^ You should just change the timeout for the recvmsg to something a lot larger than 50 ms.

@jritts
Copy link
Author

jritts commented Dec 17, 2023

Yep got it now (didn't see your new comment until after I submitted mine). Makes sense and thanks for all the detail. Will report back.

@jritts
Copy link
Author

jritts commented Dec 17, 2023

Messages are queued to the main thread, so the timeout was essentially just setting the polling period for shutting down the server, so no need at all for a small value.

Is there a way to unblock nng_recvmsg, short of spoofing a message?

Thanks very much again for your help.

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Closing the socket will wake the receivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants