-
-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C/C++ executable is SEGVing sporadically in nni_list_remove
.
#1523
Comments
Got a crash built debug. Stack goes further when built debug. (gdb) where
#0 0x0000007fb16454f8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x0000007fb16468d4 in __GI_abort () at abort.c:79
#2 0x0000005571083974 in nni_plat_abort () at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_debug.c:23
#3 0x0000005571079688 in nni_panic (fmt=0x557157b110 "pthread_mutex_unlock: %s") at /home/volta/libraries/nng-1.5.2/src/core/panic.c:66
#4 0x0000005571083c8c in nni_pthread_mutex_unlock (m=0x7f8c0038a0) at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_thread.c:94
#5 0x0000005571083e24 in nni_plat_mtx_unlock (mtx=0x7f8c0038a0) at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_thread.c:152
#6 0x000000557107fbb8 in nni_mtx_unlock (mtx=0x7f8c0038a0) at /home/volta/libraries/nng-1.5.2/src/core/thread.c:33
#7 0x000000557107f774 in nni_task_exec (task=0x7f8c003870) at /home/volta/libraries/nng-1.5.2/src/core/taskq.c:138
#8 0x00000055710730fc in nni_aio_finish_impl (aio=0x7f8c003850, rv=0, count=9, msg=0x0, sync=true) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:450
#9 0x000000557107317c in nni_aio_finish_sync (aio=0x7f8c003850, result=0, count=9) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:465
#10 0x0000005571086450 in sub0_recv_cb (arg=0x7f44001510) at /home/volta/libraries/nng-1.5.2/src/sp/protocol/pubsub0/sub.c:417
#11 0x000000557107f798 in nni_task_exec (task=0x7f44001540) at /home/volta/libraries/nng-1.5.2/src/core/taskq.c:141
#12 0x00000055710730fc in nni_aio_finish_impl (aio=0x7f44001520, rv=0, count=9, msg=0x0, sync=true) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:450
#13 0x000000557107317c in nni_aio_finish_sync (aio=0x7f44001520, result=0, count=9) at /home/volta/libraries/nng-1.5.2/src/core/aio.c:465
#14 0x000000557108b80c in ipc_pipe_recv_cb (arg=0x7f540022f0) at /home/volta/libraries/nng-1.5.2/src/sp/transport/ipc/ipc.c:422
#15 0x000000557107f328 in nni_taskq_thread (self=0x559e718510) at /home/volta/libraries/nng-1.5.2/src/core/taskq.c:47
#16 0x000000557107fd54 in nni_thr_wrap (arg=0x559e718518) at /home/volta/libraries/nng-1.5.2/src/core/thread.c:94
#17 0x00000055710841dc in nni_plat_thr_main (arg=0x559e718518) at /home/volta/libraries/nng-1.5.2/src/platform/posix/posix_thread.c:266
#18 0x0000007fb1af5088 in start_thread (arg=0x7fd8a4535f) at pthread_create.c:463
#19 0x0000007fb16e2ffc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) up
#10 0x0000005571086450 in sub0_recv_cb (arg=0x7f44001510) at /home/volta/libraries/nng-1.5.2/src/sp/protocol/pubsub0/sub.c:417
(gdb) p *aio
$9 = {a_count = 0, a_expire = 1637050488521, a_timeout = 500, a_result = 0, a_stop = false, a_sleep = false, a_expire_ok = false, a_expiring = false, a_task = {task_node = {ln_next = 0x0, ln_prev = 0x0}, task_arg = 0x0, task_cb = 0x0, task_tq = 0x559e71aef0, task_busy = 1, task_prep = true, task_mtx = {
mtx = pthread_mutex_t = {Type = Error check, Status = Not acquired, Robust = No, Shared = No, Protocol = None}}, task_cv = {cv = pthread_cond_t = {Threads known to still execute a wait function = 1, Clock ID = CLOCK_REALTIME, Shared = No}, mtx = 0x7f8c0038a0}}, a_iov = {{iov_buf = 0x0, iov_len = 0}, {
iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}, {iov_buf = 0x0, iov_len = 0}}, a_nio = 0, a_msg = 0x0, a_inputs = {0x0, 0x0, 0x0, 0x0}, a_outputs = {0x0, 0x0, 0x0, 0x0},
a_cancel_fn = 0x5571085a28 <sub0_ctx_cancel>, a_cancel_arg = 0x7f8c0010c8, a_prov_node = {ln_next = 0x7f8c0010f8, ln_prev = 0x7f8c0010f8}, a_prov_extra = {0x0, 0x0}, a_expire_q = 0x559e719160, a_expire_node = {ln_next = 0x559e7191c8, ln_prev = 0x7f680039d0}, a_reap_node = {rn_next = 0x0}} From my cursory understanding it looks like it's trying to release a mutex that isn't acquired? What other info might be instructive? I'm afraid the most expedient thing for our project may be for me to switch our code to using ZMQ. But I'm happy to try something if there are suggestions. |
Apologies, but for expediency I've switched our project to ZMQ. Given that I cannot provide a simple way to reproduce this issue, please feel free to reject or drop this issue as per whatever your policy/desire is. It could be some interaction with other libraries we're using. We are using gRPC which is frankly a monster of a library and already caused us problems in other libraries with its own SSL implementation breaking other things. |
Ok. I will investigate as I have time and leave this open for now. |
Can you please test what's in master. I'm looking back at changes, and some stuff changed rather significantly. If this reproduces in master it would be good to know, otherwise I'd just close it. |
I think I may know is causing this. It's slightly possible (narrow race) for sub aio objects to be on a list that isn't the one we mostly care about. The problem is our reliance on nni_list_active is not sufficient to the cause. |
We are also seeing this crash intermittently. |
I'm also seeing this crash intermittently. Ubuntu 20.04, running on an Intel NUC, Intel(R) Core(TM) i3-10110U. |
it's VERY difficult to reproduce, but we have seen this on current master (commit b428d51):
|
Having this issue after ~30-60min, using pynng. |
This goes away with latest release, at least for linux x86-64 (ubuntu desktop) and aarch64 (denver, tx2 nx) |
Care to elaborate on how this is possible? Every single |
@gdamore - based on your earlier analysis that there might be a race dealing with sub aio objects on lists, does it follow that there might be an equivalent race seen on the
|
we're still hitting this on commit c5e9d8a, with request/reply and pub/sub. I thought I'd add a little more context after catching a request/reply instance in GDB. single listener, single dialer. the thread that segfaults on the reply side:
companion reply-side thread, waiting to lock the pipe mutex on a close:
the pipe's
This is happening after |
Seems related to the proposed fix of: #1695 ? |
Yes, the call stack looks the same. |
Credit goes to Wu Xuan (@willwu1217) for diagnosing and proposing a fix as part of #1695. This approach takes a revised approach to avoid adding extra memory, and it also is slightly faster as we do not need to update both pointers in the linked list, by reusing the reap node. As part of this a new internal API, nni_aio_completions, is introduced. In all likelihood we will be able to use this to solve some similar crashes in other areas of the code.
Description
This is not currently very actionable by nng devs as I cannot present a simple way to reproduce it. But I'm reporting it in case someone can offer any helpful insight as I continue to try to track it down.
I'm using direct C bindings for nng 1.5.2, straightforward modest use of PUB/SUB only so far. Sadly it's something of a Heisenbug, as instrumenting the code reduces its frequency.
Here are two SEGV stack traces I've observed:
Since I found no other reports mentioning
nni_list_remove
, I initially suspected it could be issues outside nng, e.g. other code tramping over memory, so I ran under valgrind with our code compiled DEBUG, and found no problems, though the frequency of crashes reduced running our the DEBUG built code.I have now built nng itself DEBUG, so far no crashes, but execution time has been limited so far.
I only get crashes on our build on our target NVidia Jetson Nano board
ARMv8 Processor rev 1 (v8l)
, not when running on a X86. This is a low powered, ballpark of Raspberry Pi.The user of the sub channels is a thread in a while loop receiving messages, and the socket is configured with
NNG_OPT_RECVTIMEO
at 500.Environment Details
static nng 1.5.2 (also tried 1.4.0, got same crash)
gcc 9.4
Ubuntu 18.04
NVidia Jetson Nano board 4 core
ARMv8 Processor rev 1 (v8l)
. Not observed on X86, but code is exercised less there, so could be chance.The text was updated successfully, but these errors were encountered: