Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionally qubes core gets deadlocked #1636

Closed
rootkovska opened this issue Jan 15, 2016 · 9 comments
Closed

Occasionally qubes core gets deadlocked #1636

rootkovska opened this issue Jan 15, 2016 · 9 comments
Labels
C: core P: blocker Priority: blocker. Prevents release or would have prevented release if known prior to release. R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. r3.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Milestone

Comments

@rootkovska
Copy link
Member

I cannot find a reliable way to reproduce it, but have already run into this bug on 2 different, freshly installed 3.1-rc2 systems within just 2 days(!).

I suspect this happens after unsuccessful attempt to start a DispVM (See #1621) -- the system is the left with a process keeping a lock on qubes.xml preventing any other qvm-* tools, the manager, etc, from working, until the misbehaving process gets manually killed.

Here's some snippets from the last session I just encountered today:

[joanna@dom0 qubes]$ lsof | grep qubes.xml
qfile-dae 16007        joanna    3uW     REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qubesdb-d 16347        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16364        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16365        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16367        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16373        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16521        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16801        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qvm-run   16810        joanna    4r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-po 16866        joanna    3r      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
 5227 ?        SLs    0:01 /usr/lib/qubes/qrexec-daemon -q 6 work-mutt user
16002 ?        SL     0:00  \_ qrexec-client -d dom0 -c SOCKET7 work-mutt 6 /usr/lib/qubes/qfile-daemon-dvm qubes.
16007 ?        S      0:00      \_ /usr/bin/python2 /usr/lib/qubes/qfile-daemon-dvm qubes.OpenInVM work-mutt DEFAU
16364 ?        SL     0:00          \_ /usr/lib/qubes/qrexec-client -d disp2 user:QUBESRPC qubes.SetMonitorLayout 
16365 ?        SL     0:00          \_ /usr/lib/qubes/qrexec-client -d disp2 root:QUBESRPC qubes.WaitForSession no
16347 ?        SLs    0:00 /usr/sbin/qubesdb-daemon 15 disp2
16350 ?        SLs    0:00 /usr/lib/qubes/qrexec-daemon -q 15 disp2 user
16356 ?        SLs    0:00 /usr/bin/qubes-guid -d 15 -N disp2 -c 0x73d216 -i /usr/share/icons/hicolor/128x128/devi

I then manually killed just the 16007 process, and this restored system back to normal (e.g. now it executed all the scheduled qvm-run's), although a few other processes related to that dispVM still were hanging:

[joanna@dom0 qubes]$ kill 16007
[joanna@dom0 qubes]$ lsof | grep qubes.xml                                                                           
qubesdb-d 16347        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16364        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml
qrexec-cl 16365        joanna    3u      REG              253,2     25490   12061490 /var/lib/qubes/qubes.xml

For hygienic reasons I also xl-destory'ed the disp2 which also killed these remaining processes, now the lsof | grep qubes.xml showed nothing.

I think it's very important to track down this bug and fix, because a less advanced user would have no other choice than to reboot their system, possibly loosing all the unsaved work. I'm thus giving it a top priority.

@rootkovska rootkovska added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core P: blocker Priority: blocker. Prevents release or would have prevented release if known prior to release. labels Jan 15, 2016
@rootkovska rootkovska added this to the Release 3.1 milestone Jan 15, 2016
@marmarek marmarek added the R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. label Jan 15, 2016
@marmarek
Copy link
Member

This is duplicate #1389, already fixed, just uploaded package to current-testing

@rootkovska
Copy link
Member Author

Sadly this still affects me on the latest rc3 :/ This is easily reproducible in a env when lots of VMs are running and so there is a shortage of memory available, which apparently causes the DispVM launch to fail. In that case the whole qubes core gets deadlocked, as described above.

FWIW, this is the fragment of the log from the DispVM that got started but vchan didn't connect:

Feb 18 09:39:40 fedora-23-ylw-dvm prepare-dvm.sh[374]: Closing windows...
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: USER        PID ACCESS COMMAND
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: /rw:                 root     kernel mount /rw
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: done.
Feb 18 09:39:41 fedora-23-ylw-dvm su[679]: pam_unix(su-l:session): session closed for user user
Feb 18 09:39:41 fedora-23-ylw-dvm audit[679]: USER_END pid=679 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:session_close grantors=pam_keyinit,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="user" exe="/usr/bin/su" hostname=? addr=? terminal=? res=success'
Feb 18 09:39:41 fedora-23-ylw-dvm audit[679]: CRED_DISP pid=679 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/bin/su" hostname=? addr=? terminal=? res=success'
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: Waiting for save/restore...
Feb 18 09:39:41 fedora-23-ylw-dvm prepare-dvm.sh[374]: Failed to read /qubes-restore-complete
Feb 18 09:39:41 fedora-23-ylw-dvm qubes-gui[474]: libvchan_is_eof
Feb 18 09:39:41 fedora-23-ylw-dvm qubesdb-daemon[181]: vchan closed
Feb 18 09:39:41 fedora-23-ylw-dvm qubesdb-daemon[181]: reconnecting
Feb 18 09:40:37 fedora-23-ylw-dvm systemd[405]: Time has been changed
Feb 18 09:40:37 fedora-23-ylw-dvm pulseaudio[802]: vchan module loading
Feb 18 09:40:37 fedora-23-ylw-dvm pulseaudio[802]: play libvchan_fd_for_select=18, ctrl=0x55cc68d2d3a0

Manually shutting down the half-started VM helped, but otherwise the whole system is unusable due to locked qubes.

@rootkovska rootkovska reopened this Feb 21, 2016
@marmarek
Copy link
Member

What exact versions of Qubes packages do you have (both dom0 and that template)?

@marmarek
Copy link
Member

Also - how long have you waited? There is 60s timeout on qrexec connection (configurable using qrexec_timeout VM pref).

@marmarek
Copy link
Member

Ok, found one case where timeout wasn't enforced. In R3.0+ there are two vchan connection in qrexec:

  • control connection (always between dom0 and VM)
  • data connection (between VMs directly in most cases)

Missing timeout check was in the second one. In normal case, it should connect instantly (when control connection is already set up), so any unusual delay here is an error.
In case of DispVM startup there are two consecutive data connections: qubes.WaitForSession, then qubes.OpenInVM (or any other service requested in that DispVM). This means that timeout 60s for single data connection will result in 2 min "freeze" in case of failed DispVM startup. Maybe this timeout should be much smaller? Like 10s (which is already way above normal connection time)?

This all is about dealing with error condition in user-friendly way, regardless of what the actual error is. Actual DispVM startup problem is a separate issue, IMHO with much lower priority (not a release blocker).

@rootkovska
Copy link
Member Author

Yeah I think any timeout longer then 3-5 secs doesn't make sense, because it feels like an enternity for the user, and the user will be more inclined to restart the system, than wait for such a long time. Especially that this locks down effectively all the other operations, such as other VMs starting or listing even (which of course is a serious UX bug even if we reduce the timeouts to 3s or so).

@marmarek
Copy link
Member

DispVM startup can take 10-20s depending on hardware, so even 10s timeout doesn't look suspicious. I have some slow machine and will test timeout 3s.

Anyway I agree that this locking mechanism isn't ideal. This is going to improve in R4.1 (with introduction of "qubesd" and management API), but also in R4.0 we will need to solve it somehow: #1729.

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Feb 23, 2016
Even in case of some exception (in which case theoretically it should be
unlocked at qfile-daemon-dvm exit, but the script may wait for
something).

QubesOS/qubes-issues#1636
marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Feb 23, 2016
If getting memory for new VM fails for any reason, make sure that global
lock will be released. Otherwise qmemman will stop functioning at all.

QubesOS/qubes-issues#1636
@marmarek
Copy link
Member

Automated announcement from builder-github

The package qubes-core-dom0-linux-3.1.8-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@marmarek
Copy link
Member

Automated announcement from builder-github

The package qubes-core-dom0-linux-3.1.8-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Feb 29, 2016
Even in case of some exception (in which case theoretically it should be
unlocked at qfile-daemon-dvm exit, but the script may wait for
something).

QubesOS/qubes-issues#1636

(cherry picked from commit 5546d67)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core P: blocker Priority: blocker. Prevents release or would have prevented release if known prior to release. R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. r3.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

2 participants