Skip to content

Commit

Permalink
iothread: fix iothread hang when stop too soon
Browse files Browse the repository at this point in the history
Lukas reported an hard to reproduce QMP iothread hang on s390 that
QEMU might hang at pthread_join() of the QMP monitor iothread before
quitting:

  Thread 1
  #0  0x000003ffad10932c in pthread_join
  #1  0x0000000109e95750 in qemu_thread_join
      at /home/thuth/devel/qemu/util/qemu-thread-posix.c:570
  #2  0x0000000109c95a1c in iothread_stop
  qemu#3  0x0000000109bb0874 in monitor_cleanup
  qemu#4  0x0000000109b55042 in main

While the iothread is still in the main loop:

  Thread 4
  #0  0x000003ffad0010e4 in ??
  #1  0x000003ffad553958 in g_main_context_iterate.isra.19
  #2  0x000003ffad553d90 in g_main_loop_run
  qemu#3  0x0000000109c9585a in iothread_run
      at /home/thuth/devel/qemu/iothread.c:74
  qemu#4  0x0000000109e94752 in qemu_thread_start
      at /home/thuth/devel/qemu/util/qemu-thread-posix.c:502
  qemu#5  0x000003ffad10825a in start_thread
  qemu#6  0x000003ffad00dcf2 in thread_start

IMHO it's because there's a race between the main thread and iothread
when stopping the thread in following sequence:

    main thread                       iothread
    ===========                       ==============
                                      aio_poll()
    iothread_get_g_main_context
      set iothread->worker_context
    iothread_stop
      schedule iothread_stop_bh
                                        execute iothread_stop_bh [1]
                                          set iothread->running=false
                                          (since main_loop==NULL so
                                           skip to quit main loop.
                                           Note: although main_loop is
                                           NULL but worker_context is
                                           not!)
                                      atomic_read(&iothread->worker_context) [2]
                                        create main_loop object
                                        g_main_loop_run() [3]
    pthread_join() [4]

We can see that when execute iothread_stop_bh() at [1] it's possible
that main_loop is still NULL because it's only created until the first
check of the worker_context later at [2].  Then the iothread will hang
in the main loop [3] and it'll starve the main thread too [4].

Here the simple solution should be that we check again the "running"
variable before check against worker_context.

CC: Thomas Huth <[email protected]>
CC: Dr. David Alan Gilbert <[email protected]>
CC: Stefan Hajnoczi <[email protected]>
CC: Lukáš Doktor <[email protected]>
CC: Markus Armbruster <[email protected]>
CC: Eric Blake <[email protected]>
CC: Paolo Bonzini <[email protected]>
Reported-by: Lukáš Doktor <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Tested-by: Thomas Huth <[email protected]>
Message-id: [email protected]
Signed-off-by: Stefan Hajnoczi <[email protected]>
  • Loading branch information
xzpeter authored and stefanhaRH committed Feb 12, 2019
1 parent 22c5f44 commit 6c95363
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion iothread.c
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,11 @@ static void *iothread_run(void *opaque)
while (iothread->running) {
aio_poll(iothread->ctx, true);

if (atomic_read(&iothread->worker_context)) {
/*
* We must check the running state again in case it was
* changed in previous aio_poll()
*/
if (iothread->running && atomic_read(&iothread->worker_context)) {
GMainLoop *loop;

g_main_context_push_thread_default(iothread->worker_context);
Expand Down

0 comments on commit 6c95363

Please sign in to comment.