-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated reboot loop test gets stuck during reset transition #167
Comments
The theory above is mostly wrong, though it's right enough to help point to a possible real root cause. The dispatcher does know how to signal the queue CV because the block device code submits a callback to tell the dispatcher how to wake it: propolis/lib/propolis/src/block/mod.rs Lines 410 to 420 in bcea4d3
The catch here is that the block driver spawns
If I change the code to register a waker for everyone instead of just worker 0, the reboot test gets through 100 reboots without sticking, where it previously got through no more than 20 or so. |
Repro steps: Write a test/script that launches a Debian 11 no-cloud test VM and runs
sudo reboot
in the guest in a loop. (I have a private fork of the PHD branch that I used to do this.)Expected: The guest reliably reboots as many times as you ask it to.
Observed: After some iterations, the server logs
INFO Instance transition, state_target: Some(Reset), state_prev: Run, state: Quiesce, task: instance-driver, component: vmm
and then does nothing else.This is similar to omicron#1126.
Some initial debugging of a wedged Propolis server shows that the dispatcher thread coordinating the quiesce is stuck waiting for one of its children to quiesce:
The locals in frame 11 show that this thread is waiting for 10 tasks. Searching through the results of
thread apply all bt
reveals only nine threads inWorkerCtrl::check_yield
, suggesting that there's a tenth thread who is out to lunch somewhere.A little digging yields that seven of the block device control threads are in
check_yield
, but the eighth is doing this instead:Dumping the value of
sctx
in frame 10 shows that its interiorArc
pointer is to0x6b15920
, which lines up with the control state address in frames 9/10 of the dispatcher thread.This part of the blocking loop is here:
propolis/lib/propolis/src/block/mod.rs
Lines 471 to 479 in bcea4d3
On first glance, this seems fishy to me. The value of
sctx.pending_reqs()
depends on flags in theWorkerCtrl
that are synchronized with a lock and condition variable in the control block, but that's not the condition variable the worker is waiting on--cv
in this snippet is thepropolis::block::Driver
CV, and I suspect the dispatcher doesn't know how to signal that.(There is a separate possible issue here where
sync_tasks::WorkerCtrl::req_hold
doesn't signal the controller's CV either, but amending that routine to signal it doesn't fix this problem.)The text was updated successfully, but these errors were encountered: