High CPU usage with TLS #1673

jritts · 2023-07-27T21:36:59Z

We're building and running a server application on both Linux and Windows which accepts a single PAIR1 client with TLS enabled via mbedtls.

Minimal data is being sent while profiling - just periodic keep-alive messages.

We are running this commit from earlier this year, at the recommendation of @shikokuchuo from another thread (#1661):

commit 8e1836f
Date: Thu Feb 9 00:48:17 2023 +0100

On Linux, we see unexpectedly high CPU usage (20-30% single core) in NNG, given the small workload. Windows CPU usage is near 0 for the same server code and use case. The gprof output for Linux is below.

Is there a known performance problem as of this older commit? Or do you have any suggestions for how to investigate further? Thank you for your time.

NNG version: 8e1836f
Operating system and version: 5.15.0-1028-aws
Compiler and language used: clang / C++
Shared or static library: static

The text was updated successfully, but these errors were encountered:

jritts · 2023-07-27T21:37:06Z

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
85.87 0.79 0.79 nni_aio_expire_loop
7.61 0.86 0.07 nni_list_next
2.17 0.88 0.02 nni_clock
1.09 0.89 0.01 mbedtls_mpi_core_mla
1.09 0.90 0.01 nni_cv_wake
1.09 0.91 0.01 nni_http_server_init
1.09 0.92 0.01 nni_list_first
0.00 0.92 0.00 838 0.00 0.00 statusAddString(char const*, int, char const*, ...)
0.00 0.92 0.00 817 0.00 0.00 getUptimeSecs()
0.00 0.92 0.00 817 0.00 0.00 getWallTimeMillis()
0.00 0.92 0.00 806 0.00 0.00 DataServer::recvFspMsgTimed(DataClient**, long)
0.00 0.92 0.00 806 0.00 0.00 DataServer::tick()
0.00 0.92 0.00 802 0.00 0.00 moodycamel::spsc_sema::LightweightSemaphore::waitWithPartialSpinning(long)

jritts · 2023-07-27T21:40:21Z

Here is another profile output, but this time under what for us is a normal load: 16 users, each streaming approx 220kBps at on the order of 100-200 message per second each.

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
84.71 24.65 24.65 nni_aio_expire_loop
9.14 27.31 2.66 nni_list_next
1.10 27.63 0.32 nni_clock
0.79 27.86 0.23 nni_list_first
0.76 28.08 0.22 chacha20_block
0.55 28.24 0.16 _init
0.38 28.35 0.11 nni_cv_wake
0.38 28.46 0.11 nni_plat_cv_wake
0.34 28.56 0.10 poly1305_process
0.31 28.65 0.09 mbedtls_mpi_core_mla
0.14 28.69 0.04 mbedtls_mpi_core_shift_r
0.07 28.71 0.02 mbedtls_base64_decode
0.07 28.73 0.02 mbedtls_internal_sha256_process
0.07 28.75 0.02 mbedtls_mpi_core_sub
0.07 28.77 0.02 nni_list_append
0.03 28.78 0.01 27774 0.36 0.54 DataServer::recvFspMsgTimed(DataClient**, long)
0.03 28.79 0.01 18645 0.54 0.54 FspMsg2::parse(unsigned int)
0.03 28.80 0.01 9263 1.08 1.08 TakeRecorder::append(FspMsg2 const&)
0.03 28.81 0.01 ecp_modp
0.03 28.82 0.01 ecp_mul_restartable_internal
0.03 28.83 0.01 mbedtls_chacha20_starts
0.03 28.84 0.01 mbedtls_chacha20_update
0.03 28.85 0.01 mbedtls_chachapoly_finish
0.03 28.86 0.01 mbedtls_chachapoly_update
0.03 28.87 0.01 mbedtls_cipher_auth_encrypt_ext
0.03 28.88 0.01 mbedtls_debug_print_msg
0.03 28.89 0.01 mbedtls_mpi_add_abs
0.03 28.90 0.01 mbedtls_mpi_cmp_abs
0.03 28.91 0.01 mbedtls_mpi_cmp_mpi
0.03 28.92 0.01 mbedtls_mpi_core_montmul
0.03 28.93 0.01 mbedtls_mpi_mul_mpi
0.03 28.94 0.01 mbedtls_mpi_sub_abs
0.03 28.95 0.01 mbedtls_platform_zeroize
0.03 28.96 0.01 mbedtls_poly1305_update
0.03 28.97 0.01 mbedtls_ssl_fetch_input
0.03 28.98 0.01 mbedtls_ssl_get_mode_from_transform
0.03 28.99 0.01 nng_tls_engine_recv
0.03 29.00 0.01 nni_aio_begin
0.03 29.01 0.01 nni_aio_set_iov
0.03 29.02 0.01 nni_list_node_remove
0.03 29.03 0.01 nni_mtx_unlock
0.03 29.04 0.01 nni_plat_cv_wake1
0.03 29.05 0.01 nni_plat_init
0.03 29.06 0.01 nni_plat_mtx_lock
0.03 29.07 0.01 nni_plat_mtx_unlock
0.03 29.08 0.01 tls_tcp_recv_cb
0.03 29.09 0.01 tlstran_pipe_send
0.03 29.10 0.01 tlstran_pipe_send_cb
0.00 29.10 0.00 47947 0.00 0.00 getWallTimeMillis()
0.00 29.10 0.00 37143 0.00 0.00 getUptimeSecs()

shikokuchuo · 2023-07-27T21:45:35Z

@jritts apologies about the commit information - it should be fine to use the latest. See my comments at #1662

As you are using TLS, perhaps you made the same mistake of building mbedtls without pthread support? that would explain why there is a high load on Linux vs Windows.

jritts · 2023-07-27T22:03:49Z

Ah that's definitely possible, I'll take a look. And thanks for letting me about using latest.

By the way, I really appreciate this library and especially the support from the main devs like you. Thanks very much for all of your help with the (very few) issues I've hit.

jritts · 2023-07-27T23:21:46Z

I rebuilt mbedtls with pthread and updated NNG to the latest, but I'm still seeing high CPU usage.

This time, there are no active connections at all. I have a REP socket open and listening, with no incoming REQ connections and no other sockets, and seeing ~30% CPU usage on an t2.medium EC2 node.

I set a breakpoint on the hotpath from the earlier gprof output, and it does seem like NNG is basically tight looping on this stack in one of the background threads. The name of the thread is "nng:aio:expire".

#0 nni_list_next (list=list@entry=0x555555c48ce0, item=item@entry=0x7fffee2bf7f0) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/list.c:113
#1 0x0000555555644c13 in nni_aio_expire_loop (arg=0x555555c48c80) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/aio.c:583
#2 0x000055555564ebfa in nni_thr_wrap (arg=0x555555c48cf8) at /home/ubuntu/fsp-net/thirdparty/nng/src/core/thread.c:94
#3 0x0000555555650803 in nni_plat_thr_main (arg=0x555555c48cf8) at /home/ubuntu/fsp-net/thirdparty/nng/src/platform/posix/posix_thread.c:266
#4 0x00007ffff7a57b43 in start_thread (arg=) at ./nptl/pthread_create.c:442
#5 0x00007ffff7ae9a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)

shikokuchuo · 2023-07-28T10:36:29Z

Just thinking about this - are you sure NNG is linking to your updated mbedtls? You can specify the path directly through cmake (see cmake/FindmbedTLS.cmake) but you need to check it picks it up correctly. Best is just to eliminate any other versions on the system.

If that is not it, then is it pair 1 that is causing the trouble as per the issue title, or is it req/rep as you mention in your replies? I don't have any experience of pair1 myself, so this would have to be something for gdamore.

gdamore · 2023-09-14T08:46:51Z

The only reason I can think of for the nni_aio_expire loop to be spinning like this is either a serious defect that I don't understand, or the time keeping is busted, or you have a lot of unexpired timers (aios). The code does a somewhat painful linear walk, but if you're not spamming aio's it shouldn't be too bad.

Are you creating aios on demand?

gdamore · 2023-12-03T21:48:50Z

I'm probably going to close this unless we can get some confirmation about the above questions.

gdamore · 2023-12-03T21:49:09Z

Or alternatively, a simple reproduction case.

jritts · 2023-12-03T22:08:14Z

Sorry for the wait! Will it suffice for me to give you a prebuilt linux binary with symbols, statically linking to NNG, along with the NNG source tree used to build it?

Alternatively I could package NNG as a solib.

I also haven't updated NNG since I originally reported this. Perhaps I should do that first to check it hasn't been fixed.

gdamore · 2023-12-03T22:15:42Z

Please do try updating first. • Garrett

…

On Dec 3, 2023 at 2:08 PM -0800, James Ritts ***@***.***>, wrote: Sorry for the wait! Will it suffice for me to give you a prebuilt linux binary with symbols, statically linking to NNG, along with the NNG source tree used to build it? Alternatively I could package NNG as a solib. I also haven't updated NNG since I originally reported this. Perhaps I should do that first to check it hasn't been fixed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

gdamore · 2023-12-15T23:35:29Z

Sorry, was mixing this ticket up with a different.

gdamore · 2023-12-15T23:38:43Z

I'm still wondering about the inordinately high use in the aio_expire loop... this area is one that has had historic problems...

I'd still like a reproduction case if possible. I don't really want a binary -- as I need to review the source to see if there is something I'm missing. If you can strip it down to the minimal parts (I don't want to see your IP) that would help.

jritts · 2023-12-16T00:39:14Z

For sure, and sorry for the delay - I'll work to get you something tonight, if I can reproduce it in an isolated test with tip NNG.

jritts · 2023-12-16T07:08:18Z

Very sorry Garrett, I'm not sure I gave correct information in comments earlier this year; the server actually creates many PAIR1 listeners.

I've attached a simple test that uses the tip as of tonight, 5750ca1. There are some settings at the top of the file for the cert path and how many ports to open.

Below is what I see on an AWS EC2 t2.medium instance, running Ubuntu 22.04 LTS. It creates 50 PAIR1 TLS sockets, then sleeps indefinitely with no connections, which hits ~20% CPU. If I switch to plain tcp, usage drops to less than half that. CPU% seems to scale pretty linearly with the socket count.

Is this what you'd expect?

$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 22.04 LTS
Release:    22.04
Codename:   jammy

$ ./nngtest &
NNG v1.6.0, Mbed-TLS v3.4.0

$ top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1062394 ubuntu    20   0  794592   6400   3840 S  22.9   0.2   0:02.31 nngtest

$ valgrind --tool=callgrind ./nngtest

$ callgrind_annotate callgrind.out.1062645 | less

--------------------------------------------------------------------------------
Ir                   file:function
--------------------------------------------------------------------------------
67,534,125 (28.36%)  fsp-net/thirdparty/nng/src/core/aio.c:nni_aio_expire_loop [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
46,651,147 (19.59%)  fsp-net/thirdparty/nng/src/core/list.c:nni_list_next [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
41,196,900 (17.30%)  fsp-net/thirdparty/mbedtls/library/constant_time.c:mbedtls_ct_base64_dec_value [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
21,974,350 ( 9.23%)  fsp-net/thirdparty/mbedtls/library/base64.c:mbedtls_base64_decode [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
 6,849,066 ( 2.88%)  ./nptl/./nptl/pthread_mutex_lock.c:pthread_mutex_lock@@GLIBC_2.2.5 [/usr/lib/x86_64-linux-gnu/libc.so.6]
 5,976,360 ( 2.51%)  fsp-net/thirdparty/nng/src/platform/posix/posix_clock.c:nni_clock [/home/ubuntu/fsp-net/build/app/nngtest/nngtest]
 5,846,980 ( 2.46%)  ./nptl/./nptl/pthread_mutex_unlock.c:pthread_mutex_unlock@@GLIBC_2.2.5 [/usr/lib/x86_64-linux-gnu/libc.so.6]
 5,267,867 ( 2.21%)  ./nptl/./nptl/pthread_cond_broadcast.c:pthread_cond_broadcast@@GLIBC_2.3.2 [/usr/lib/x86_64-linux-gnu/libc.so.6]
 2,529,560 ( 1.06%)  ./time/../sysdeps/unix/sysv/linux/clock_gettime.c:clock_gettime@@GLIBC_2.17 [/usr/lib/x86_64-linux-gnu/libc.so.6]

nngtest.cpp.txt

gdamore · 2023-12-16T21:05:15Z

I think I might know what is going on. I think the aio expiration queue might spin when it goes from having items with a timeout to being empty. And if there is nothing timing out, then it will indeed go empty. I'm going to test some things to find out.

gdamore · 2023-12-16T21:07:36Z

Well, nni_time is unsigned... so maybe not. Hm...

The aio for connections was meant to have an infinite sleep (no timeout), but was getting an initial value of zero, so we were spinning on accept.

gdamore · 2023-12-16T21:17:10Z

You can try this: https://github.com/nanomsg/nng/pull/1725/files

I suspect strongly that this was the problem. We were almost certainly spinning in accept().

gdamore · 2023-12-16T21:20:12Z

If you confirm that this fixes the problem, I will cut a release with it. Because that's clearly a pretty serious defect.

gdamore · 2023-12-16T21:37:41Z

I merged the change, and while this "closed" the ticket, I'd like to know if you can verify that it actually resolves it. I think it will. If you find otherwise, please let me know and we'll open another ticket.

jritts · 2023-12-17T02:00:55Z

It actually slightly increased CPU usage by a few percentage points - 23 to 25.5.

Just to be sure the change was built properly, I added a log to tls_alloc() and it's printed 50 times, once per socket.


diff --git a/src/supplemental/tls/tls_common.c b/src/supplemental/tls/tls_common.c
index d05a289d..826eb139 100644
--- a/src/supplemental/tls/tls_common.c
+++ b/src/supplemental/tls/tls_common.c
@@ -837,6 +837,9 @@ tls_get(void *arg, const char *name, void *buf, size_t *szp, nni_type t)
 static int
 tls_alloc(tls_conn **conn_p, nng_tls_config *cfg, nng_aio *user_aio)
 {
+       static int counter = 0;
+       fprintf(stderr, "tls_alloc %d\n", counter++);
+
        tls_conn *            conn;
        const nng_tls_engine *eng;
        size_t                size;
@@ -867,6 +870,7 @@ tls_alloc(tls_conn **conn_p, nng_tls_config *cfg, nng_aio *user_aio)
        nni_aio_list_init(&conn->send_queue);
        nni_aio_list_init(&conn->recv_queue);
        nni_mtx_init(&conn->lock);
+       nni_aio_set_timeout(&conn->conn_aio, NNG_DURATION_INFINITE);
        nni_aio_set_timeout(&conn->tcp_send, NNG_DURATION_INFINITE);
        nni_aio_set_timeout(&conn->tcp_recv, NNG_DURATION_INFINITE);

gdamore · 2023-12-17T04:56:13Z

Ooof. This is kind of crazy. (Previous comment edited because that was something else.)

gdamore · 2023-12-17T05:02:28Z

Does the server start in that state, or does it only happen if you're connected to it? What if all the clients disconnect? Does it return to idle?

gdamore · 2023-12-17T05:03:16Z

Your comments mention starting sockets ,but not actually connecting anything....

gdamore · 2023-12-17T05:54:56Z

So I'm running the test now ... You have code that is timing out every 50 millseconds -- for each socket. You have 50 sockets. That is potentially 1000 wakeups per second. That's where the time is going.

Now I have a theory as to why this is worse with TLS.

Essentially, if you're using TCP, then the setup is very cheap and fast, and we all these timeouts will basically land in the same tick. In a perfect world they all set up within a msec, then you're only doing 20 wakeups per second. Probably you're not in the pathological case for TLS, but some of the activity is going to take longer .. definitely reading the file from disk is probably going to affect things, and we don't really have a clear picture on how long mbedTLS is going to take to process the certificate and key.

A trivial way to test for this is to bump your read timeouts up quite a bit. Try 5 seconds instead.

Having lots of short timeouts is a scalability problem for NNG, and I have some ideas on how to address them in the future, but ultimately it's always going be the case that the fast code paths is reserved for operations that do not timeout. In a healthy system, that's what we expect to see.

Conversely if you're just posting a read here because you want to have a blocking read -- then make it a blocking read. :-)

One thing that you should not do is use short timeouts so you can go do other work. If you need to do that, its better to have a single nng_sleep_aio() -- which can be used to run a periodic callback -- but have just one of those, and let all other socket I/O either sleep or use longer sensible timeouts.

gdamore · 2023-12-17T06:01:12Z

Actually, the printf probably exacerbated the problem by spreading the timeouts too! (By slightly extending the setup time, so making it less likely that any two sockets would wind up getting initialized with the same final timeout.)

jritts · 2023-12-17T06:04:40Z

It starts in that state, no connections - I'm just building and running that single source file, then running 'top'. I'll share a profile first thing in the morning.

I can also write a new makefile for it within a docker container so it's easy to reproduce; and rule out (or not) some sort of toolchain thing or dependency being the root cause.

jritts · 2023-12-17T06:06:54Z

Will share a profile with tls disabled too in case it's useful.

Should I try any socket opts (nonblocking etc)?

gdamore · 2023-12-17T06:08:27Z

Please see above ^^ You should just change the timeout for the recvmsg to something a lot larger than 50 ms.

jritts · 2023-12-17T06:39:27Z

Yep got it now (didn't see your new comment until after I submitted mine). Makes sense and thanks for all the detail. Will report back.

jritts · 2023-12-17T17:42:54Z

Messages are queued to the main thread, so the timeout was essentially just setting the polling period for shutting down the server, so no need at all for a small value.

Is there a way to unblock nng_recvmsg, short of spoofing a message?

Thanks very much again for your help.

gdamore · 2023-12-17T17:46:10Z

Closing the socket will wake the receivers.

jritts changed the title ~~Slow performance of PAIR1~~ High CPU usage of PAIR1 server on Linux Jul 27, 2023

jritts changed the title ~~High CPU usage of PAIR1 server on Linux~~ High CPU usage on Linux Jul 28, 2023

gdamore added the feedback label Dec 3, 2023

gdamore changed the title ~~High CPU usage on Linux~~ High CPU usage with may REQ contexts on Linux Dec 15, 2023

gdamore changed the title ~~High CPU usage with may REQ contexts on Linux~~ REQ contexts do not scale well Dec 15, 2023

gdamore changed the title ~~REQ contexts do not scale well~~ High CPU usage on Linux with mbedTLS Dec 15, 2023

gdamore changed the title ~~High CPU usage on Linux with mbedTLS~~ High CPU usage with TLS Dec 16, 2023

gdamore added a commit that referenced this issue Dec 16, 2023

fixes #1673 High CPU usage with TLS

a14035c

The aio for connections was meant to have an infinite sleep (no timeout), but was getting an initial value of zero, so we were spinning on accept.

gdamore closed this as completed in cc58517 Dec 16, 2023

gdamore removed the feedback label Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage with TLS #1673

High CPU usage with TLS #1673

jritts commented Jul 27, 2023 •

edited

Loading

jritts commented Jul 27, 2023

jritts commented Jul 27, 2023

shikokuchuo commented Jul 27, 2023

jritts commented Jul 27, 2023

jritts commented Jul 27, 2023

shikokuchuo commented Jul 28, 2023

gdamore commented Sep 14, 2023

gdamore commented Dec 3, 2023

gdamore commented Dec 3, 2023

jritts commented Dec 3, 2023

gdamore commented Dec 3, 2023 via email

gdamore commented Dec 15, 2023

gdamore commented Dec 15, 2023

jritts commented Dec 16, 2023

jritts commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

jritts commented Dec 17, 2023 •

edited

Loading

gdamore commented Dec 17, 2023 •

edited

Loading

gdamore commented Dec 17, 2023

gdamore commented Dec 17, 2023

gdamore commented Dec 17, 2023

gdamore commented Dec 17, 2023

jritts commented Dec 17, 2023

jritts commented Dec 17, 2023

gdamore commented Dec 17, 2023

jritts commented Dec 17, 2023

jritts commented Dec 17, 2023

gdamore commented Dec 17, 2023

High CPU usage with TLS #1673

High CPU usage with TLS #1673

Comments

jritts commented Jul 27, 2023 • edited Loading

jritts commented Jul 27, 2023

jritts commented Jul 27, 2023

shikokuchuo commented Jul 27, 2023

jritts commented Jul 27, 2023

jritts commented Jul 27, 2023

shikokuchuo commented Jul 28, 2023

gdamore commented Sep 14, 2023

gdamore commented Dec 3, 2023

gdamore commented Dec 3, 2023

jritts commented Dec 3, 2023

gdamore commented Dec 3, 2023 via email

gdamore commented Dec 15, 2023

gdamore commented Dec 15, 2023

jritts commented Dec 16, 2023

jritts commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

gdamore commented Dec 16, 2023

jritts commented Dec 17, 2023 • edited Loading

gdamore commented Dec 17, 2023 • edited Loading

gdamore commented Dec 17, 2023

gdamore commented Dec 17, 2023

gdamore commented Dec 17, 2023

gdamore commented Dec 17, 2023

jritts commented Dec 17, 2023

jritts commented Dec 17, 2023

gdamore commented Dec 17, 2023

jritts commented Dec 17, 2023

jritts commented Dec 17, 2023

gdamore commented Dec 17, 2023

jritts commented Jul 27, 2023 •

edited

Loading

jritts commented Dec 17, 2023 •

edited

Loading

gdamore commented Dec 17, 2023 •

edited

Loading