Occasionally qubes core gets deadlocked #1636

rootkovska · 2016-01-15T16:22:31Z

I cannot find a reliable way to reproduce it, but have already run into this bug on 2 different, freshly installed 3.1-rc2 systems within just 2 days(!).

I suspect this happens after unsuccessful attempt to start a DispVM (See #1621) -- the system is the left with a process keeping a lock on qubes.xml preventing any other qvm-* tools, the manager, etc, from working, until the misbehaving process gets manually killed.

Here's some snippets from the last session I just encountered today:

[joanna@dom0 qubes]$ lsof | grep qubes.xml
qfile-dae 16007        joanna    3uW     REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qubesdb-d 16347        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16364        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16365        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16367        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16373        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16521        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16801        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16810        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-po 16866        joanna    3r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml

 5227 ?        SLs    0:01 /usr/lib/qubes/qrexec-daemon -q 6 work-mutt user
16002 ?        SL     0:00  \_ qrexec-client -d dom0 -c SOCKET7 work-mutt 6 /usr/lib/qubes/qfile-daemon-dvm qubes.
16007 ?        S      0:00      \_ /usr/bin/python2 /usr/lib/qubes/qfile-daemon-dvm qubes.OpenInVM work-mutt DEFAU
16364 ?        SL     0:00          \_ /usr/lib/qubes/qrexec-client -d disp2 user:QUBESRPC qubes.SetMonitorLayout 
16365 ?        SL     0:00          \_ /usr/lib/qubes/qrexec-client -d disp2 root:QUBESRPC qubes.WaitForSession no

16347 ?        SLs    0:00 /usr/sbin/qubesdb-daemon 15 disp2
16350 ?        SLs    0:00 /usr/lib/qubes/qrexec-daemon -q 15 disp2 user
16356 ?        SLs    0:00 /usr/bin/qubes-guid -d 15 -N disp2 -c 0x73d216 -i /usr/share/icons/hicolor/128x128/devi

I then manually killed just the 16007 process, and this restored system back to normal (e.g. now it executed all the scheduled qvm-run's), although a few other processes related to that dispVM still were hanging:

[joanna@dom0 qubes]$ kill 16007
[joanna@dom0 qubes]$ lsof | grep qubes.xml                                                                           
qubesdb-d 16347        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16364        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16365        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml

For hygienic reasons I also xl-destory'ed the disp2 which also killed these remaining processes, now the lsof | grep qubes.xml showed nothing.

I think it's very important to track down this bug and fix, because a less advanced user would have no other choice than to reboot their system, possibly loosing all the unsaved work. I'm thus giving it a top priority.

The text was updated successfully, but these errors were encountered:

marmarek · 2016-01-15T16:36:25Z

This is duplicate #1389, already fixed, just uploaded package to current-testing

rootkovska · 2016-02-21T09:39:11Z

Sadly this still affects me on the latest rc3 :/ This is easily reproducible in a env when lots of VMs are running and so there is a shortage of memory available, which apparently causes the DispVM launch to fail. In that case the whole qubes core gets deadlocked, as described above.

FWIW, this is the fragment of the log from the DispVM that got started but vchan didn't connect:

Feb 18 09:39:40 fedora-23-ylw-dvm prepare-dvm.sh[374]: Closing windows...
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: USER        PID ACCESS COMMAND
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: /rw:                 root     kernel mount /rw
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: done.
Feb 18 09:39:41 fedora-23-ylw-dvm su[679]: pam_unix(su-l:session): session closed for user user
Feb 18 09:39:41 fedora-23-ylw-dvm audit[679]: USER_END pid=679 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:session_close grantors=pam_keyinit,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="user" exe="/usr/bin/su" hostname=? addr=? terminal=? res=success'
Feb 18 09:39:41 fedora-23-ylw-dvm audit[679]: CRED_DISP pid=679 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/bin/su" hostname=? addr=? terminal=? res=success'
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: Waiting for save/restore...
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: Failed to read /qubes-restore-complete
Feb 18 09:39:41 fedora-23-ylw-dvm qubes-gui[474]: libvchan_is_eof
Feb 18 09:39:41 fedora-23-ylw-dvm qubesdb-daemon[181]: vchan closed
Feb 18 09:39:41 fedora-23-ylw-dvm qubesdb-daemon[181]: reconnecting
Feb 18 09:40:37 fedora-23-ylw-dvm systemd[405]: Time has been changed
Feb 18 09:40:37 fedora-23-ylw-dvm pulseaudio[802]: vchan module loading
Feb 18 09:40:37 fedora-23-ylw-dvm pulseaudio[802]: play libvchan_fd_for_select=18, ctrl=0x55cc68d2d3a0

Manually shutting down the half-started VM helped, but otherwise the whole system is unusable due to locked qubes.

marmarek · 2016-02-21T11:22:38Z

What exact versions of Qubes packages do you have (both dom0 and that template)?

marmarek · 2016-02-21T11:39:58Z

Also - how long have you waited? There is 60s timeout on qrexec connection (configurable using qrexec_timeout VM pref).

marmarek · 2016-02-21T18:58:02Z

Ok, found one case where timeout wasn't enforced. In R3.0+ there are two vchan connection in qrexec:

control connection (always between dom0 and VM)
data connection (between VMs directly in most cases)

Missing timeout check was in the second one. In normal case, it should connect instantly (when control connection is already set up), so any unusual delay here is an error.
In case of DispVM startup there are two consecutive data connections: qubes.WaitForSession, then qubes.OpenInVM (or any other service requested in that DispVM). This means that timeout 60s for single data connection will result in 2 min "freeze" in case of failed DispVM startup. Maybe this timeout should be much smaller? Like 10s (which is already way above normal connection time)?

This all is about dealing with error condition in user-friendly way, regardless of what the actual error is. Actual DispVM startup problem is a separate issue, IMHO with much lower priority (not a release blocker).

rootkovska · 2016-02-22T09:14:58Z

Yeah I think any timeout longer then 3-5 secs doesn't make sense, because it feels like an enternity for the user, and the user will be more inclined to restart the system, than wait for such a long time. Especially that this locks down effectively all the other operations, such as other VMs starting or listing even (which of course is a serious UX bug even if we reduce the timeouts to 3s or so).

marmarek · 2016-02-22T09:25:04Z

DispVM startup can take 10-20s depending on hardware, so even 10s timeout doesn't look suspicious. I have some slow machine and will test timeout 3s.

Anyway I agree that this locking mechanism isn't ideal. This is going to improve in R4.1 (with introduction of "qubesd" and management API), but also in R4.0 we will need to solve it somehow: #1729.

Even in case of some exception (in which case theoretically it should be unlocked at qfile-daemon-dvm exit, but the script may wait for something). QubesOS/qubes-issues#1636

If getting memory for new VM fails for any reason, make sure that global lock will be released. Otherwise qmemman will stop functioning at all. QubesOS/qubes-issues#1636

marmarek · 2016-02-23T15:35:00Z

Automated announcement from builder-github

The package qubes-core-dom0-linux-3.1.8-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

marmarek · 2016-02-28T19:05:17Z

Automated announcement from builder-github

The package qubes-core-dom0-linux-3.1.8-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Even in case of some exception (in which case theoretically it should be unlocked at qfile-daemon-dvm exit, but the script may wait for something). QubesOS/qubes-issues#1636 (cherry picked from commit 5546d67)

rootkovska added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core P: blocker Priority: blocker. Prevents release or would have prevented release if known prior to release. labels Jan 15, 2016

rootkovska added this to the Release 3.1 milestone Jan 15, 2016

marmarek added the R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. label Jan 15, 2016

marmarek closed this as completed Jan 15, 2016

rootkovska reopened this Feb 21, 2016

marmarek closed this as completed in marmarek/qubes-core-admin-linux@f8d8368 Feb 23, 2016

marmarek added the r3.1-dom0-cur-test label Feb 23, 2016

marmarek added r3.1-dom0-stable and removed r3.1-dom0-cur-test labels Feb 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasionally qubes core gets deadlocked #1636

Occasionally qubes core gets deadlocked #1636

rootkovska commented Jan 15, 2016

marmarek commented Jan 15, 2016

rootkovska commented Feb 21, 2016

marmarek commented Feb 21, 2016

marmarek commented Feb 21, 2016

marmarek commented Feb 21, 2016

rootkovska commented Feb 22, 2016

marmarek commented Feb 22, 2016

marmarek commented Feb 23, 2016

marmarek commented Feb 28, 2016

Occasionally qubes core gets deadlocked #1636

Occasionally qubes core gets deadlocked #1636

Comments

rootkovska commented Jan 15, 2016

marmarek commented Jan 15, 2016

rootkovska commented Feb 21, 2016

marmarek commented Feb 21, 2016

marmarek commented Feb 21, 2016

marmarek commented Feb 21, 2016

rootkovska commented Feb 22, 2016

marmarek commented Feb 22, 2016

marmarek commented Feb 23, 2016

marmarek commented Feb 28, 2016