Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C/C++ executable is SEGVing sporadically in nni_list_remove. #1523

Closed
WJCFerguson opened this issue Nov 15, 2021 · 16 comments
Closed

C/C++ executable is SEGVing sporadically in nni_list_remove. #1523

WJCFerguson opened this issue Nov 15, 2021 · 16 comments
Assignees
Labels

Comments

@WJCFerguson
Copy link

Description

This is not currently very actionable by nng devs as I cannot present a simple way to reproduce it. But I'm reporting it in case someone can offer any helpful insight as I continue to try to track it down.

I'm using direct C bindings for nng 1.5.2, straightforward modest use of PUB/SUB only so far. Sadly it's something of a Heisenbug, as instrumenting the code reduces its frequency.

Here are two SEGV stack traces I've observed:

#0  0x0000005572ba6240 in nni_list_remove ()
#1  0x0000005572bb6400 in sub0_recv_cb ()
#2  0x0000005572bb0584 in nni_task_exec ()
#3  0x0000005572ba2da0 in nni_aio_finish_impl ()
#4  0x0000005572ba2e20 in nni_aio_finish_sync ()
#5  0x0000005572bb8b10 in ipc_pipe_recv_cb ()
#6  0x0000005572bb0114 in nni_taskq_thread ()
#7  0x0000005572bb0b40 in nni_thr_wrap ()
#8  0x0000005572bb41bc in nni_plat_thr_main ()
#9  0x0000007f88277088 in start_thread (arg=0x7fd0cde9df) at pthread_create.c:463
#10 0x0000007f87e64ffc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
#0  0x0000005570b752d0 in nni_list_remove ()
#1  0x0000005570b836c4 in sub0_ctx_cancel ()
#2  0x0000005570b721d8 in nni_aio_expire_loop ()
#3  0x0000005570b7e994 in nni_thr_wrap ()
#4  0x0000005570b81e1c in nni_plat_thr_main ()
#5  0x0000007f790d8088 in start_thread (arg=0x7fe75ba76f) at pthread_create.c:463
#6  0x0000007f78cc5ffc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Since I found no other reports mentioning nni_list_remove, I initially suspected it could be issues outside nng, e.g. other code tramping over memory, so I ran under valgrind with our code compiled DEBUG, and found no problems, though the frequency of crashes reduced running our the DEBUG built code.

I have now built nng itself DEBUG, so far no crashes, but execution time has been limited so far.

I only get crashes on our build on our target NVidia Jetson Nano board ARMv8 Processor rev 1 (v8l), not when running on a X86. This is a low powered, ballpark of Raspberry Pi.

The user of the sub channels is a thread in a while loop receiving messages, and the socket is configured with NNG_OPT_RECVTIMEO at 500.

Environment Details

static nng 1.5.2 (also tried 1.4.0, got same crash)
gcc 9.4
Ubuntu 18.04
NVidia Jetson Nano board 4 core ARMv8 Processor rev 1 (v8l). Not observed on X86, but code is exercised less there, so could be chance.

@WJCFerguson
Copy link
Author

Got a crash built debug. Stack goes further when built debug.

(gdb) where
#0  0x0000007fb16454f8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000007fb16468d4 in __GI_abort () at abort.c:79
#2  0x0000005571083974 in nni_plat_abort () at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_debug.c:23
#3  0x0000005571079688 in nni_panic (fmt=0x557157b110 "pthread_mutex_unlock: %s") at /home/volta/libraries/nng-1.5.2/src/core/panic.c:66
#4  0x0000005571083c8c in nni_pthread_mutex_unlock (m=0x7f8c0038a0) at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_thread.c:94
#5  0x0000005571083e24 in nni_plat_mtx_unlock (mtx=0x7f8c0038a0) at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_thread.c:152
#6  0x000000557107fbb8 in nni_mtx_unlock (mtx=0x7f8c0038a0) at /home/volta/libraries/nng-1.5.2/src/core/thread.c:33
#7  0x000000557107f774 in nni_task_exec (task=0x7f8c003870) at /home/volta/libraries/nng-1.5.2/src/core/taskq.c:138
#8  0x00000055710730fc in nni_aio_finish_impl (aio=0x7f8c003850, rv=0, count=9, msg=0x0, sync=true) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:450
#9  0x000000557107317c in nni_aio_finish_sync (aio=0x7f8c003850, result=0, count=9) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:465
#10 0x0000005571086450 in sub0_recv_cb (arg=0x7f44001510) at /home/volta/libraries/nng-1.5.2/src/sp/protocol/pubsub0/sub.c:417
#11 0x000000557107f798 in nni_task_exec (task=0x7f44001540) at /home/volta/libraries/nng-1.5.2/src/core/taskq.c:141
#12 0x00000055710730fc in nni_aio_finish_impl (aio=0x7f44001520, rv=0, count=9, msg=0x0, sync=true) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:450
#13 0x000000557107317c in nni_aio_finish_sync (aio=0x7f44001520, result=0, count=9) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:465
#14 0x000000557108b80c in ipc_pipe_recv_cb (arg=0x7f540022f0) at /home/volta/libraries/nng-1.5.2/src/sp/transport/ipc/ipc.c:422
#15 0x000000557107f328 in nni_taskq_thread (self=0x559e718510) at /home/volta/libraries/nng-1.5.2/src/core/taskq.c:47
#16 0x000000557107fd54 in nni_thr_wrap (arg=0x559e718518) at /home/volta/libraries/nng-1.5.2/src/core/thread.c:94
#17 0x00000055710841dc in nni_plat_thr_main (arg=0x559e718518) at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_thread.c:266
#18 0x0000007fb1af5088 in start_thread (arg=0x7fd8a4535f) at pthread_create.c:463
#19 0x0000007fb16e2ffc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) up
#10 0x0000005571086450 in sub0_recv_cb (arg=0x7f44001510) at /home/volta/libraries/nng-1.5.2/src/sp/protocol/pubsub0/sub.c:417
(gdb) p *aio
$9 = {a_count = 0, a_expire = 1637050488521, a_timeout = 500, a_result = 0, a_stop = false, a_sleep = false, a_expire_ok = false, a_expiring = false, a_task = {task_node = {ln_next = 0x0, ln_prev = 0x0}, task_arg = 0x0, task_cb = 0x0, task_tq = 0x559e71aef0, task_busy = 1, task_prep = true, task_mtx = {
      mtx = pthread_mutex_t = {Type = Error check, Status = Not acquired, Robust = No, Shared = No, Protocol = None}}, task_cv = {cv = pthread_cond_t = {Threads known to still execute a wait function = 1, Clock ID = CLOCK_REALTIME, Shared = No}, mtx = 0x7f8c0038a0}}, a_iov = {{iov_buf = 0x0, iov_len = 0}, {
      iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}}, a_nio = 0, a_msg = 0x0, a_inputs = {0x0, 0x0, 0x0, 0x0}, a_outputs = {0x0, 0x0, 0x0, 0x0}, 
  a_cancel_fn = 0x5571085a28 <sub0_ctx_cancel>, a_cancel_arg = 0x7f8c0010c8, a_prov_node = {ln_next = 0x7f8c0010f8, ln_prev = 0x7f8c0010f8}, a_prov_extra = {0x0, 0x0}, a_expire_q = 0x559e719160, a_expire_node = {ln_next = 0x559e7191c8, ln_prev = 0x7f680039d0}, a_reap_node = {rn_next = 0x0}}

From my cursory understanding it looks like it's trying to release a mutex that isn't acquired? What other info might be instructive?

I'm afraid the most expedient thing for our project may be for me to switch our code to using ZMQ. But I'm happy to try something if there are suggestions.

@WJCFerguson
Copy link
Author

Apologies, but for expediency I've switched our project to ZMQ. Given that I cannot provide a simple way to reproduce this issue, please feel free to reject or drop this issue as per whatever your policy/desire is.

It could be some interaction with other libraries we're using. We are using gRPC which is frankly a monster of a library and already caused us problems in other libraries with its own SSL implementation breaking other things.

@gdamore
Copy link
Contributor

gdamore commented Nov 19, 2021

Ok. I will investigate as I have time and leave this open for now.

@gdamore
Copy link
Contributor

gdamore commented Dec 8, 2021

Can you please test what's in master. I'm looking back at changes, and some stuff changed rather significantly.

If this reproduces in master it would be good to know, otherwise I'd just close it.

@gdamore
Copy link
Contributor

gdamore commented Dec 27, 2021

I think I may know is causing this. It's slightly possible (narrow race) for sub aio objects to be on a list that isn't the one we mostly care about. The problem is our reliance on nni_list_active is not sufficient to the cause.

@gdamore gdamore added bug and removed feedback labels Dec 27, 2021
@ghost
Copy link

ghost commented May 16, 2022

Thread 55 "nng:aio:expire" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff2dff3640 (LWP 7611)]
nni_list_remove (list=0x7ffe80003770, item=0x7ffee00083d0) at external/nng/src/core/list.c:139
139	in external/nng/src/core/list.c
(gdb) bt
#0  nni_list_remove (list=0x7ffe80003770, item=0x7ffee00083d0) at external/nng/src/core/list.c:139
#1  0x00005555565e7ddc in sub0_ctx_cancel (aio=0x7ffee00083d0, arg=0x7ffe80003740, rv=5) at external/nng/src/sp/protocol/pubsub0/sub.c:86
#2  0x00005555565d67b6 in nni_aio_expire_loop (arg=0x5555574540a0) at external/nng/src/core/aio.c:585
#3  0x00005555565e33f5 in nni_thr_wrap (arg=0x555557454120) at external/nng/src/core/thread.c:94
#4  0x00005555565e64ed in nni_plat_thr_main (arg=0x555557454120) at external/nng/src/platform/posix/posix_thread.c:218
#5  0x00007ffff477bb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007ffff480da00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

We are also seeing this crash intermittently.
tag: v1.4.0.

@cadaker
Copy link

cadaker commented May 24, 2022

I'm also seeing this crash intermittently.

Ubuntu 20.04, running on an Intel NUC, Intel(R) Core(TM) i3-10110U.

@JoshSalitSonos
Copy link
Contributor

it's VERY difficult to reproduce, but we have seen this on current master (commit b428d51):

#0  nni_list_remove (list=list@entry=0x555556064dc8, item=item@entry=0x555556065760) at ../git/src/core/list.c:139
#1  0x00007ffff79c6cd2 in sub0_ctx_cancel (aio=aio@entry=0x555556065760, arg=arg@entry=0x555556064d98, rv=rv@entry=20) at ../git/src/sp/protocol/pubsub0/sub.c:86
#2  0x00007ffff79b4d02 in nni_aio_stop (aio=0x555556065760) at ../git/src/core/aio.c:217
#3  0x00007ffff79b3fb4 in nng_aio_stop (aio=<optimized out>) at ../git/src/nng.c:1877

@ntakouris
Copy link

Having this issue after ~30-60min, using pynng.

@ntakouris
Copy link

ntakouris commented Oct 12, 2022

This goes away with latest release, at least for linux x86-64 (ubuntu desktop) and aarch64 (denver, tx2 nx)
UPDATE 17/10/2022: No, it does not go away.

@ntakouris
Copy link

I think I may know is causing this. It's slightly possible (narrow race) for sub aio objects to be on a list that isn't the one we mostly care about. The problem is our reliance on nni_list_active is not sufficient to the cause.

Care to elaborate on how this is possible? Every single nni_lis op is properly guarded by some sort of acquire/unlock mechanism..

@JoshSalitSonos
Copy link
Contributor

@gdamore - based on your earlier analysis that there might be a race dealing with sub aio objects on lists, does it follow that there might be an equivalent race seen on the nni_list_append side? e.g.:

0   libc.so.6                       0x7fa2d9b9ac        gsignal (raise.c:51)
1   libc.so.6                       0x7fa2d89a64        abort (abort.c:79)
2   libnng.so.1                     0x7fa2b7cab8        nni_plat_abort (posix_debug.c:23)
3   libnng.so.1                     0x7fa2b84958        nni_panic (panic.c:66)
4   libnng.so.1                     0x7fa2b81f38        nni_list_append (list.c:59)
5   libnng.so.1                     0x7fa2b88f4c        nni_task_dispatch (taskq.c:172)
6   libnng.so.1                     0x7fa2b95800        rep0_cancel_recv (rep.c:410)
7   libnng.so.1                     0x7fa2b7fc78        nni_aio_expire_loop (aio.c:616)
8   libnng.so.1                     0x7fa2b89adc        nni_thr_wrap (thread.c:94)
9   libnng.so.1                     0x7fa2b90080        nni_plat_thr_main (posix_thread.c:266)
10  libpthread.so.0                 0x7fa2ee2918        start_thread (pthread_create.c:486)
11  libc.so.6                       0x7fa2e39ac8        thread_start

@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2023

So there is a possibility that other corruption (use after free) might have been impacting this -- although it seems unlikely that it would be so specific. Please see #1657 and #1658.

Meanwhile I'm going to look at this a bit more closely.

@JoshSalitSonos
Copy link
Contributor

we're still hitting this on commit c5e9d8a, with request/reply and pub/sub. I thought I'd add a little more context after catching a request/reply instance in GDB. single listener, single dialer.

the thread that segfaults on the reply side:

#0  nni_list_node_remove (node=node@entry=0x198) at ../git/src/core/list.c:170
#1  0x00007ffff7e3c15b in nni_aio_list_remove (aio=aio@entry=0x0) at ../git/src/core/aio.c:502
#2  0x00007ffff7e504f4 in ipc_pipe_send_cb (arg=0x7fffe4003b90) at ../git/src/sp/transport/ipc/ipc.c:328
#3  0x00007ffff7e42c67 in nni_taskq_thread (self=<optimized out>) at ../git/src/core/taskq.c:47
#4  0x00007ffff7e43aa9 in nni_thr_wrap (arg=0x5555561e05a8) at ../git/src/core/thread.c:94
#5  0x00007ffff7e498ff in nni_plat_thr_main (arg=0x5555561e05a8) at ../git/src/platform/posix/posix_thread.c:266
#6  0x00007ffff7a8c412 in start_thread (arg=<optimized out>) at pthread_create.c:442
#7  0x00007ffff7b0dab0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

companion reply-side thread, waiting to lock the pipe mutex on a close:

#0  futex_wait (private=0, expected=2, futex_word=0x7fffe4004230) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7fffe4004230, private=0) at lowlevellock.c:49
#2  0x00007ffff7a8f722 in lll_mutex_lock_optimized (mutex=0x7fffe4004230) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x7fffe4004230) at pthread_mutex_lock.c:93
#4  0x00007ffff7e4996e in nni_pthread_mutex_lock (m=m@entry=0x7fffe4004230) at ../git/src/platform/posix/posix_thread.c:84
#5  nni_plat_mtx_lock (mtx=mtx@entry=0x7fffe4004230) at ../git/src/platform/posix/posix_thread.c:146
#6  0x00007ffff7e43ae2 in nni_mtx_lock (mtx=mtx@entry=0x7fffe4004230) at ../git/src/core/thread.c:27
#7  0x00007ffff7e4f950 in ipc_pipe_close (arg=0x7fffe4003b90) at ../git/src/sp/transport/ipc/ipc.c:110
#8  0x00007ffff7e4ffd1 in ipc_ep_close (arg=0x5555561f09b0) at ../git/src/sp/transport/ipc/ipc.c:674
#9  0x00007ffff7e41d9b in nni_listener_shutdown (l=l@entry=0x5555561f2cd0) at ../git/src/core/socket.c:1667
#10 0x00007ffff7e3decc in nni_listener_close (l=l@entry=0x5555561f2cd0) at ../git/src/core/listener.c:331
#11 0x00007ffff7e408b9 in nni_sock_shutdown (sock=sock@entry=0x5555561f1480) at ../git/src/core/socket.c:696
#12 0x00007ffff7e40a44 in nni_sock_close (s=0x5555561f1480) at ../git/src/core/socket.c:780
#13 0x00007ffff7e393f7 in nng_close (s=...) at ../git/src/nng.c:45

the pipe's send_q is definitely empty:

gdb$ p &p.send_q
$16 = (nni_list *) 0x7fffe4003cb8
gdb$ p p.send_q
$17 = {
  ll_head = {
    ln_next = 0x7fffe4003cb8,
    ln_prev = 0x7fffe4003cb8
  },
  ll_offset = 408
}

This is happening after nng_close has been called on the dialer side's socket. @gdamore referencing your earlier mention of nni_list_active, I see that it only checks if ln_next == NULL, unlike nni_list_empty which checks for the condition seen above. is that what you meant by "nni_list_active is not sufficient to the cause"?

@Bartel-C8
Copy link

Seems related to the proposed fix of: #1695 ?

@willwu1217
Copy link

Seems related to the proposed fix of: #1695 ?

Yes, the call stack looks the same.

gdamore added a commit that referenced this issue Nov 26, 2023
Credit goes to Wu Xuan (@willwu1217) for diagnosing and proposing
a fix as part of #1695.  This approach takes a revised approach
to avoid adding extra memory, and it also is slightly faster as we
do not need to update both pointers in the linked list, by reusing
the reap node.

As part of this a new internal API, nni_aio_completions, is introduced.
In all likelihood we will be able to use this to solve some similar
crashes in other areas of the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants