You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've encountered a definite dead-lock in the close logic -- this is reproducible at low frequency/ likelihood just using the ws_test test tool.
Essentially we have two threads:
First:
frame #0: 0x00000001864bf710 libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x00000001864fc574 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x000000010501b324 ws_test`nni_pthread_cond_wait(c=0x00000001550051e0, m=0x00000001550051a0) at posix_thread.c:121:12
frame #3: 0x000000010501b2f8 ws_test`nni_plat_cv_wait(cv=0x00000001550051e0) at posix_thread.c:229:2
frame #4: 0x00000001050106d0 ws_test`nni_cv_wait(cv=0x00000001550051e0) at thread.c:51:2
* frame #5: 0x000000010502fe38 ws_test`http_server_stop(s=0x00000001550050d0) at http_server.c:1074:3
frame #6: 0x000000010502fd40 ws_test`nni_http_server_stop(s=0x00000001550050d0) at http_server.c:1084:3
frame #7: 0x00000001050354e4 ws_test`ws_listener_close(arg=0x0000000155004d80) at websocket.c:1757:3
frame #8: 0x000000010500d884 ws_test`nng_stream_listener_close(l=0x0000000155004d80) at stream.c:213:2
frame #9: 0x0000000105027a08 ws_test`wstran_listener_close(arg=0x0000000155004d00) at websocket.c:447:2
frame #10: 0x00000001050037a0 ws_test`nni_listener_stop(l=0x0000000155808a00) at listener.c:418:2
frame #11: 0x000000010500b4a8 ws_test`nni_listener_shutdown(l=0x0000000155808a00) at socket.c:1667:2
frame #12: 0x000000010500366c ws_test`nni_listener_close(l=0x0000000155808a00) at listener.c:331:2
frame #13: 0x00000001050090c0 ws_test`nni_sock_shutdown(sock=0x000000015480b600) at socket.c:696:3
frame #14: 0x00000001050093c8 ws_test`nni_sock_close(s=0x000000015480b600) at socket.c:780:2
frame #15: 0x0000000104ff8c44 ws_test`nng_close(s=(id = 3)) at nng.c:45:2
frame #16: 0x0000000104ff742c ws_test`test_wild_card_port at ws_test.c:82:2
(So you can see it is waiting on http_server_stop().)
The second thread:
frame #2: 0x000000010501b324 ws_test`nni_pthread_cond_wait(c=0x0000000153f05ac8, m=0x0000000153f05a88) at posix_thread.c:121:12
frame #3: 0x000000010501b2f8 ws_test`nni_plat_cv_wait(cv=0x0000000153f05ac8) at posix_thread.c:229:2
frame #4: 0x00000001050106d0 ws_test`nni_cv_wait(cv=0x0000000153f05ac8) at thread.c:51:2
frame #5: 0x000000010500efe0 ws_test`nni_taskq_thread(self=0x00000001548096b8) at taskq.c:64:3
frame #6: 0x0000000105010980 ws_test`nni_thr_wrap(arg=0x00000001548096c0) at thread.c:94:3
frame #7: 0x000000010501b5f8 ws_test`nni_plat_thr_main(arg=0x00000001548096c0) at posix_thread.c:266:2
frame #8: 0x00000001864fbfa8 libsystem_pthread.dylib`_pthread_start + 148
thread #9, name = 'nng:task'
frame #0: 0x00000001864bebc8 libsystem_kernel.dylib`__psynch_mutexwait + 8
frame #1: 0x00000001864f90c4 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
frame #2: 0x00000001864f6a5c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
frame #3: 0x000000010501af20 ws_test`nni_pthread_mutex_lock(m=0x0000000155004dc0) at posix_thread.c:84:12
frame #4: 0x000000010501aefc ws_test`nni_plat_mtx_lock(mtx=0x0000000155004dc0) at posix_thread.c:146:2
frame #5: 0x0000000105010638 ws_test`nni_mtx_lock(mtx=0x0000000155004dc0) at thread.c:27:2
frame #6: 0x0000000105034d4c ws_test`ws_handler(aio=0x0000000155504ba0) at websocket.c:1526:2
frame #7: 0x0000000105032ab8 ws_test`http_sconn_rxdone(arg=0x0000000155504520) at http_server.c:759:2
frame #8: 0x000000010500ef60 ws_test`nni_taskq_thread(self=0x0000000154809770) at taskq.c:47:4
frame #9: 0x0000000105010980 ws_test`nni_thr_wrap(arg=0x0000000154809778) at thread.c:94:3
frame #10: 0x000000010501b5f8 ws_test`nni_plat_thr_main(arg=0x0000000154809778) at posix_thread.c:266:2
frame #11: 0x00000001864fbfa8 libsystem_pthread.dylib`_pthread_start + 148
thread #10, name = 'nng:task'
frame #0: 0x00000001864bf710 libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x00000001864fc574 libsystem_pthread.dylib`_pthread_cond_wait + 1232
This thread is the websocket trying to complete receive. But the lock in question being asked by the second thread is held by the first, and this second thread needs to complete in order for the first to make progress.
THe http_server_stop thread has somethign like this:
while (!nni_list_empty(&s->conns)) {
nni_cv_wait(&s->cv);
}
But it does this while holding the listener lock from websocket -
nni_mtx_lock(&l->mtx);
if (l->closed) {
nni_mtx_unlock(&l->mtx);
return;
}
l->closed = true;
if (l->started) {
nni_http_server_del_handler(l->server, l->handler);
nni_http_server_stop(l->server);
l->started = false;
}
The thing is, it doesn't need to do this. That is, we do not need to wait for the server connections to drop -- and in fact we don't do this for any other transport. When the listener goes away, it does not necessarily take down the connections that have already been established.
So we need a version of http_server_stop that does not wait.
You have to be somewhat unlucky to hit this, but I believe this is one of the sources of the occasional failures in CI/CD. I ran the test in a tight loop on my mac, and it took about 5 minutes to hit it.
The text was updated successfully, but these errors were encountered:
I've encountered a definite dead-lock in the close logic -- this is reproducible at low frequency/ likelihood just using the ws_test test tool.
Essentially we have two threads:
First:
(So you can see it is waiting on http_server_stop().)
The second thread:
This thread is the websocket trying to complete receive. But the lock in question being asked by the second thread is held by the first, and this second thread needs to complete in order for the first to make progress.
THe http_server_stop thread has somethign like this:
But it does this while holding the listener lock from websocket -
The thing is, it doesn't need to do this. That is, we do not need to wait for the server connections to drop -- and in fact we don't do this for any other transport. When the listener goes away, it does not necessarily take down the connections that have already been established.
So we need a version of http_server_stop that does not wait.
You have to be somewhat unlucky to hit this, but I believe this is one of the sources of the occasional failures in CI/CD. I ran the test in a tight loop on my mac, and it took about 5 minutes to hit it.
The text was updated successfully, but these errors were encountered: