-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasionally qubes core gets deadlocked #1636
Comments
This is duplicate #1389, already fixed, just uploaded package to current-testing |
Sadly this still affects me on the latest rc3 :/ This is easily reproducible in a env when lots of VMs are running and so there is a shortage of memory available, which apparently causes the DispVM launch to fail. In that case the whole qubes core gets deadlocked, as described above. FWIW, this is the fragment of the log from the DispVM that got started but vchan didn't connect:
Manually shutting down the half-started VM helped, but otherwise the whole system is unusable due to locked qubes. |
What exact versions of Qubes packages do you have (both dom0 and that template)? |
Also - how long have you waited? There is 60s timeout on qrexec connection (configurable using qrexec_timeout VM pref). |
Ok, found one case where timeout wasn't enforced. In R3.0+ there are two vchan connection in qrexec:
Missing timeout check was in the second one. In normal case, it should connect instantly (when control connection is already set up), so any unusual delay here is an error. This all is about dealing with error condition in user-friendly way, regardless of what the actual error is. Actual DispVM startup problem is a separate issue, IMHO with much lower priority (not a release blocker). |
Yeah I think any timeout longer then 3-5 secs doesn't make sense, because it feels like an enternity for the user, and the user will be more inclined to restart the system, than wait for such a long time. Especially that this locks down effectively all the other operations, such as other VMs starting or listing even (which of course is a serious UX bug even if we reduce the timeouts to 3s or so). |
DispVM startup can take 10-20s depending on hardware, so even 10s timeout doesn't look suspicious. I have some slow machine and will test timeout 3s. Anyway I agree that this locking mechanism isn't ideal. This is going to improve in R4.1 (with introduction of "qubesd" and management API), but also in R4.0 we will need to solve it somehow: #1729. |
Even in case of some exception (in which case theoretically it should be unlocked at qfile-daemon-dvm exit, but the script may wait for something). QubesOS/qubes-issues#1636
If getting memory for new VM fails for any reason, make sure that global lock will be released. Otherwise qmemman will stop functioning at all. QubesOS/qubes-issues#1636
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
Even in case of some exception (in which case theoretically it should be unlocked at qfile-daemon-dvm exit, but the script may wait for something). QubesOS/qubes-issues#1636 (cherry picked from commit 5546d67)
I cannot find a reliable way to reproduce it, but have already run into this bug on 2 different, freshly installed 3.1-rc2 systems within just 2 days(!).
I suspect this happens after unsuccessful attempt to start a DispVM (See #1621) -- the system is the left with a process keeping a lock on
qubes.xml
preventing any other qvm-* tools, the manager, etc, from working, until the misbehaving process gets manually killed.Here's some snippets from the last session I just encountered today:
I then manually killed just the 16007 process, and this restored system back to normal (e.g. now it executed all the scheduled qvm-run's), although a few other processes related to that dispVM still were hanging:
For hygienic reasons I also xl-destory'ed the disp2 which also killed these remaining processes, now the lsof | grep qubes.xml showed nothing.
I think it's very important to track down this bug and fix, because a less advanced user would have no other choice than to reboot their system, possibly loosing all the unsaved work. I'm thus giving it a top priority.
The text was updated successfully, but these errors were encountered: