Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock in websocket listener close #1733

Closed
gdamore opened this issue Dec 18, 2023 · 0 comments
Closed

deadlock in websocket listener close #1733

gdamore opened this issue Dec 18, 2023 · 0 comments

Comments

@gdamore
Copy link
Contributor

gdamore commented Dec 18, 2023

I've encountered a definite dead-lock in the close logic -- this is reproducible at low frequency/ likelihood just using the ws_test test tool.

Essentially we have two threads:

First:

    frame #0: 0x00000001864bf710 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x00000001864fc574 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x000000010501b324 ws_test`nni_pthread_cond_wait(c=0x00000001550051e0, m=0x00000001550051a0) at posix_thread.c:121:12
    frame #3: 0x000000010501b2f8 ws_test`nni_plat_cv_wait(cv=0x00000001550051e0) at posix_thread.c:229:2
    frame #4: 0x00000001050106d0 ws_test`nni_cv_wait(cv=0x00000001550051e0) at thread.c:51:2
  * frame #5: 0x000000010502fe38 ws_test`http_server_stop(s=0x00000001550050d0) at http_server.c:1074:3
    frame #6: 0x000000010502fd40 ws_test`nni_http_server_stop(s=0x00000001550050d0) at http_server.c:1084:3
    frame #7: 0x00000001050354e4 ws_test`ws_listener_close(arg=0x0000000155004d80) at websocket.c:1757:3
    frame #8: 0x000000010500d884 ws_test`nng_stream_listener_close(l=0x0000000155004d80) at stream.c:213:2
    frame #9: 0x0000000105027a08 ws_test`wstran_listener_close(arg=0x0000000155004d00) at websocket.c:447:2
    frame #10: 0x00000001050037a0 ws_test`nni_listener_stop(l=0x0000000155808a00) at listener.c:418:2
    frame #11: 0x000000010500b4a8 ws_test`nni_listener_shutdown(l=0x0000000155808a00) at socket.c:1667:2
    frame #12: 0x000000010500366c ws_test`nni_listener_close(l=0x0000000155808a00) at listener.c:331:2
    frame #13: 0x00000001050090c0 ws_test`nni_sock_shutdown(sock=0x000000015480b600) at socket.c:696:3
    frame #14: 0x00000001050093c8 ws_test`nni_sock_close(s=0x000000015480b600) at socket.c:780:2
    frame #15: 0x0000000104ff8c44 ws_test`nng_close(s=(id = 3)) at nng.c:45:2
    frame #16: 0x0000000104ff742c ws_test`test_wild_card_port at ws_test.c:82:2

(So you can see it is waiting on http_server_stop().)

The second thread:

    frame #2: 0x000000010501b324 ws_test`nni_pthread_cond_wait(c=0x0000000153f05ac8, m=0x0000000153f05a88) at posix_thread.c:121:12
    frame #3: 0x000000010501b2f8 ws_test`nni_plat_cv_wait(cv=0x0000000153f05ac8) at posix_thread.c:229:2
    frame #4: 0x00000001050106d0 ws_test`nni_cv_wait(cv=0x0000000153f05ac8) at thread.c:51:2
    frame #5: 0x000000010500efe0 ws_test`nni_taskq_thread(self=0x00000001548096b8) at taskq.c:64:3
    frame #6: 0x0000000105010980 ws_test`nni_thr_wrap(arg=0x00000001548096c0) at thread.c:94:3
    frame #7: 0x000000010501b5f8 ws_test`nni_plat_thr_main(arg=0x00000001548096c0) at posix_thread.c:266:2
    frame #8: 0x00000001864fbfa8 libsystem_pthread.dylib`_pthread_start + 148
  thread #9, name = 'nng:task'
    frame #0: 0x00000001864bebc8 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x00000001864f90c4 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x00000001864f6a5c libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x000000010501af20 ws_test`nni_pthread_mutex_lock(m=0x0000000155004dc0) at posix_thread.c:84:12
    frame #4: 0x000000010501aefc ws_test`nni_plat_mtx_lock(mtx=0x0000000155004dc0) at posix_thread.c:146:2
    frame #5: 0x0000000105010638 ws_test`nni_mtx_lock(mtx=0x0000000155004dc0) at thread.c:27:2
    frame #6: 0x0000000105034d4c ws_test`ws_handler(aio=0x0000000155504ba0) at websocket.c:1526:2
    frame #7: 0x0000000105032ab8 ws_test`http_sconn_rxdone(arg=0x0000000155504520) at http_server.c:759:2
    frame #8: 0x000000010500ef60 ws_test`nni_taskq_thread(self=0x0000000154809770) at taskq.c:47:4
    frame #9: 0x0000000105010980 ws_test`nni_thr_wrap(arg=0x0000000154809778) at thread.c:94:3
    frame #10: 0x000000010501b5f8 ws_test`nni_plat_thr_main(arg=0x0000000154809778) at posix_thread.c:266:2
    frame #11: 0x00000001864fbfa8 libsystem_pthread.dylib`_pthread_start + 148
  thread #10, name = 'nng:task'
    frame #0: 0x00000001864bf710 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x00000001864fc574 libsystem_pthread.dylib`_pthread_cond_wait + 1232

This thread is the websocket trying to complete receive. But the lock in question being asked by the second thread is held by the first, and this second thread needs to complete in order for the first to make progress.

THe http_server_stop thread has somethign like this:

	while (!nni_list_empty(&s->conns)) {
		nni_cv_wait(&s->cv);
	}

But it does this while holding the listener lock from websocket -

	nni_mtx_lock(&l->mtx);
	if (l->closed) {
		nni_mtx_unlock(&l->mtx);
		return;
	}
	l->closed = true;
	if (l->started) {
		nni_http_server_del_handler(l->server, l->handler);
		nni_http_server_stop(l->server);
		l->started = false;
	}

The thing is, it doesn't need to do this. That is, we do not need to wait for the server connections to drop -- and in fact we don't do this for any other transport. When the listener goes away, it does not necessarily take down the connections that have already been established.

So we need a version of http_server_stop that does not wait.

You have to be somewhat unlucky to hit this, but I believe this is one of the sources of the occasional failures in CI/CD. I ran the test in a tight loop on my mac, and it took about 5 minutes to hit it.

@gdamore gdamore changed the title deadlock in websocket close deadlock in websocket listener close Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant