Automated reboot loop test gets stuck during reset transition #167

gjcolombo · 2022-07-25T17:59:31Z

Repro steps: Write a test/script that launches a Debian 11 no-cloud test VM and runs sudo reboot in the guest in a loop. (I have a private fork of the PHD branch that I used to do this.)

Expected: The guest reliably reboots as many times as you ask it to.

Observed: After some iterations, the server logs INFO Instance transition, state_target: Some(Reset), state_prev: Run, state: Quiesce, task: instance-driver, component: vmm and then does nothing else.

This is similar to omicron#1126.

Some initial debugging of a wedged Propolis server shows that the dispatcher thread coordinating the quiesce is stuck waiting for one of its children to quiesce:

#0  0xfffffc7fee6b1207 in __lwp_park () from /lib/64/libc.so.1
#1  0xfffffc7fee6aa460 in cond_sleep_queue () from /lib/64/libc.so.1
#2  0xfffffc7fee6aa6bb in cond_wait_queue () from /lib/64/libc.so.1
#3  0xfffffc7fee6aacfa in __cond_wait () from /lib/64/libc.so.1
#4  0xfffffc7fee6aad64 in cond_wait () from /lib/64/libc.so.1
#5  0xfffffc7fee6aad9b in pthread_cond_wait () from /lib/64/libc.so.1
#6  0x0000000003f925e6 in std::sys::unix::locks::pthread_condvar::Condvar::wait (self=0x6b22c30, mutex=0x6b24fa0)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sys/unix/locks/pthread_condvar.rs:82
#7  0x0000000004142ee6 in std::sys_common::condvar::Condvar::wait (self=0x6b15940, mutex=0x6b15930)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sys_common/condvar.rs:43
#8  0x0000000004293d3c in std::sync::condvar::Condvar::wait<propolis::dispatch::sync_tasks::WorkerCtrlState> (self=0x6b15940,
    guard=...) at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:189
#9  0x000000000429325e in std::sync::condvar::Condvar::wait_while<propolis::dispatch::sync_tasks::WorkerCtrlState, propolis::dispatch::sync_tasks::{impl#2}::wait_until_held::{closure_env#0}> (self=0x6b15940, guard=..., condition=...)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:247
#10 0x0000000004119a13 in propolis::dispatch::sync_tasks::WorkerCtrl::wait_until_held (self=0x6b15930)
    at lib/propolis/src/dispatch/sync_tasks.rs:354
#11 0x0000000004118b70 in propolis::dispatch::sync_tasks::SyncDispatch::quiesce (self=0x6afc6d8, shared=...)
    at lib/propolis/src/dispatch/sync_tasks.rs:191

The locals in frame 11 show that this thread is waiting for 10 tasks. Searching through the results of thread apply all bt reveals only nine threads in WorkerCtrl::check_yield, suggesting that there's a tenth thread who is out to lunch somewhere.

A little digging yields that seven of the block device control threads are in check_yield, but the eighth is doing this instead:

#7  0x0000000004142ee6 in std::sys_common::condvar::Condvar::wait (self=0x6ac9730, mutex=0x6ac9700)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sys_common/condvar.rs:43
#8  0x0000000004293e1c in std::sync::condvar::Condvar::wait<alloc::collections::vec_deque::VecDeque<propolis::block::Request, alloc::alloc::Global>> (self=0x6ac9730, guard=...) at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:189
#9  0x0000000004292fa2 in std::sync::condvar::Condvar::wait_while<alloc::collections::vec_deque::VecDeque<propolis::block::Request, alloc::alloc::Global>, propolis::block::{impl#4}::blocking_loop::{closure_env#0}> (self=0x6ac9730, guard=..., condition=...)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:247
#10 0x0000000004222b1b in propolis::block::Driver::blocking_loop (self=0x6ac96e0, sctx=0xfffffc7fe59ff960)
    at lib/propolis/src/block/mod.rs:471

Dumping the value of sctx in frame 10 shows that its interior Arc pointer is to 0x6b15920, which lines up with the control state address in frames 9/10 of the dispatcher thread.

This part of the blocking loop is here:

propolis/lib/propolis/src/block/mod.rs

Lines 471 to 479 in bcea4d3

    
           let _guard = self 
        
               .cv 
        
               .wait_while(guard, |g| { 
        
                   // While `sctx.check_yield()` is tempting here, it will 
        
                   // block if this thread goes into a quiesce state, 
        
                   // excluding all others from the queue lock. 
        
                   g.is_empty() && !sctx.pending_reqs() 
        
               }) 
        
               .unwrap();

On first glance, this seems fishy to me. The value of sctx.pending_reqs() depends on flags in the WorkerCtrl that are synchronized with a lock and condition variable in the control block, but that's not the condition variable the worker is waiting on--cv in this snippet is the propolis::block::Driver CV, and I suspect the dispatcher doesn't know how to signal that.

(There is a separate possible issue here where sync_tasks::WorkerCtrl::req_hold doesn't signal the controller's CV either, but amending that routine to signal it doesn't fix this problem.)

The text was updated successfully, but these errors were encountered:

gjcolombo · 2022-07-25T22:05:01Z

The theory above is mostly wrong, though it's right enough to help point to a possible real root cause.

The dispatcher does know how to signal the queue CV because the block device code submits a callback to tell the dispatcher how to wake it:

propolis/lib/propolis/src/block/mod.rs

Lines 410 to 420 in bcea4d3

    
           let wake = if i == 0 { 
        
               let notify_self = Arc::downgrade(self); 
        
               Some(Box::new(move |_ctx: &DispCtx| { 
        
                   if let Some(this) = notify_self.upgrade() { 
        
                       let _guard = this.queue.lock().unwrap(); 
        
                       this.cv.notify_all(); 
        
                   } 
        
               }) as Box<WakeFn>) 
        
           } else { 
        
               None 
        
           };

The catch here is that the block driver spawns worker_count worker threads, each with their own quiesce registration, but only the first one registers a wake callback. This seems to allow the following race:

World begins; workers 0 and 1 start
Worker 0 reaches the wait_while at line 473; worker 1 context switches out in the middle of the worker loop
Quiesce begins.
The quiescing thread takes worker 0's context's lock, sees that req_hold is not set, sets it, and then returns that it did so.
Because 0's req_hold went from false to true, the quiesce thread calls its wake callback, which signals the block driver's condition variable.
Worker 1 finally reaches the wait_while and blocks.
The quiescing thread takes worker 1's context lock, sees that req_hold is not set, sets it, and returns that it did so. But this time, the driver CV doesn't get signaled, because only worker 0 has a wake routine attached.

If I change the code to register a waker for everyone instead of just worker 0, the reboot test gets through 100 reboots without sticking, where it previously got through no more than 20 or so.

gjcolombo mentioned this issue Jul 25, 2022

Have all block device workers register wake callbacks #168

Merged

gjcolombo closed this as completed in #168 Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated reboot loop test gets stuck during reset transition #167

Automated reboot loop test gets stuck during reset transition #167

gjcolombo commented Jul 25, 2022

gjcolombo commented Jul 25, 2022

Automated reboot loop test gets stuck during reset transition #167

Automated reboot loop test gets stuck during reset transition #167

Comments

gjcolombo commented Jul 25, 2022

gjcolombo commented Jul 25, 2022