Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated reboot loop test gets stuck during reset transition #167

Closed
gjcolombo opened this issue Jul 25, 2022 · 1 comment · Fixed by #168
Closed

Automated reboot loop test gets stuck during reset transition #167

gjcolombo opened this issue Jul 25, 2022 · 1 comment · Fixed by #168

Comments

@gjcolombo
Copy link
Contributor

Repro steps: Write a test/script that launches a Debian 11 no-cloud test VM and runs sudo reboot in the guest in a loop. (I have a private fork of the PHD branch that I used to do this.)

Expected: The guest reliably reboots as many times as you ask it to.

Observed: After some iterations, the server logs INFO Instance transition, state_target: Some(Reset), state_prev: Run, state: Quiesce, task: instance-driver, component: vmm and then does nothing else.

This is similar to omicron#1126.


Some initial debugging of a wedged Propolis server shows that the dispatcher thread coordinating the quiesce is stuck waiting for one of its children to quiesce:

#0  0xfffffc7fee6b1207 in __lwp_park () from /lib/64/libc.so.1
#1  0xfffffc7fee6aa460 in cond_sleep_queue () from /lib/64/libc.so.1
#2  0xfffffc7fee6aa6bb in cond_wait_queue () from /lib/64/libc.so.1
#3  0xfffffc7fee6aacfa in __cond_wait () from /lib/64/libc.so.1
#4  0xfffffc7fee6aad64 in cond_wait () from /lib/64/libc.so.1
#5  0xfffffc7fee6aad9b in pthread_cond_wait () from /lib/64/libc.so.1
#6  0x0000000003f925e6 in std::sys::unix::locks::pthread_condvar::Condvar::wait (self=0x6b22c30, mutex=0x6b24fa0)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sys/unix/locks/pthread_condvar.rs:82
#7  0x0000000004142ee6 in std::sys_common::condvar::Condvar::wait (self=0x6b15940, mutex=0x6b15930)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sys_common/condvar.rs:43
#8  0x0000000004293d3c in std::sync::condvar::Condvar::wait<propolis::dispatch::sync_tasks::WorkerCtrlState> (self=0x6b15940,
    guard=...) at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:189
#9  0x000000000429325e in std::sync::condvar::Condvar::wait_while<propolis::dispatch::sync_tasks::WorkerCtrlState, propolis::dispatch::sync_tasks::{impl#2}::wait_until_held::{closure_env#0}> (self=0x6b15940, guard=..., condition=...)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:247
#10 0x0000000004119a13 in propolis::dispatch::sync_tasks::WorkerCtrl::wait_until_held (self=0x6b15930)
    at lib/propolis/src/dispatch/sync_tasks.rs:354
#11 0x0000000004118b70 in propolis::dispatch::sync_tasks::SyncDispatch::quiesce (self=0x6afc6d8, shared=...)
    at lib/propolis/src/dispatch/sync_tasks.rs:191

The locals in frame 11 show that this thread is waiting for 10 tasks. Searching through the results of thread apply all bt reveals only nine threads in WorkerCtrl::check_yield, suggesting that there's a tenth thread who is out to lunch somewhere.

A little digging yields that seven of the block device control threads are in check_yield, but the eighth is doing this instead:

#7  0x0000000004142ee6 in std::sys_common::condvar::Condvar::wait (self=0x6ac9730, mutex=0x6ac9700)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sys_common/condvar.rs:43
#8  0x0000000004293e1c in std::sync::condvar::Condvar::wait<alloc::collections::vec_deque::VecDeque<propolis::block::Request, alloc::alloc::Global>> (self=0x6ac9730, guard=...) at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:189
#9  0x0000000004292fa2 in std::sync::condvar::Condvar::wait_while<alloc::collections::vec_deque::VecDeque<propolis::block::Request, alloc::alloc::Global>, propolis::block::{impl#4}::blocking_loop::{closure_env#0}> (self=0x6ac9730, guard=..., condition=...)
    at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/sync/condvar.rs:247
#10 0x0000000004222b1b in propolis::block::Driver::blocking_loop (self=0x6ac96e0, sctx=0xfffffc7fe59ff960)
    at lib/propolis/src/block/mod.rs:471

Dumping the value of sctx in frame 10 shows that its interior Arc pointer is to 0x6b15920, which lines up with the control state address in frames 9/10 of the dispatcher thread.

This part of the blocking loop is here:

let _guard = self
.cv
.wait_while(guard, |g| {
// While `sctx.check_yield()` is tempting here, it will
// block if this thread goes into a quiesce state,
// excluding all others from the queue lock.
g.is_empty() && !sctx.pending_reqs()
})
.unwrap();

On first glance, this seems fishy to me. The value of sctx.pending_reqs() depends on flags in the WorkerCtrl that are synchronized with a lock and condition variable in the control block, but that's not the condition variable the worker is waiting on--cv in this snippet is the propolis::block::Driver CV, and I suspect the dispatcher doesn't know how to signal that.

(There is a separate possible issue here where sync_tasks::WorkerCtrl::req_hold doesn't signal the controller's CV either, but amending that routine to signal it doesn't fix this problem.)

@gjcolombo
Copy link
Contributor Author

The theory above is mostly wrong, though it's right enough to help point to a possible real root cause.

The dispatcher does know how to signal the queue CV because the block device code submits a callback to tell the dispatcher how to wake it:

let wake = if i == 0 {
let notify_self = Arc::downgrade(self);
Some(Box::new(move |_ctx: &DispCtx| {
if let Some(this) = notify_self.upgrade() {
let _guard = this.queue.lock().unwrap();
this.cv.notify_all();
}
}) as Box<WakeFn>)
} else {
None
};

The catch here is that the block driver spawns worker_count worker threads, each with their own quiesce registration, but only the first one registers a wake callback. This seems to allow the following race:

  1. World begins; workers 0 and 1 start
  2. Worker 0 reaches the wait_while at line 473; worker 1 context switches out in the middle of the worker loop
  3. Quiesce begins.
  4. The quiescing thread takes worker 0's context's lock, sees that req_hold is not set, sets it, and then returns that it did so.
  5. Because 0's req_hold went from false to true, the quiesce thread calls its wake callback, which signals the block driver's condition variable.
  6. Worker 1 finally reaches the wait_while and blocks.
  7. The quiescing thread takes worker 1's context lock, sees that req_hold is not set, sets it, and returns that it did so. But this time, the driver CV doesn't get signaled, because only worker 0 has a wake routine attached.

If I change the code to register a waker for everyone instead of just worker 0, the reboot test gets through 100 reboots without sticking, where it previously got through no more than 20 or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant