-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM instance of QubesDB (sometimes) crashes on DispVM restore #1389
Comments
This is already quite fatal situation, but do not make it even worse by spinning in endless loop. QubesOS/qubes-issues#1389
This is already quite fatal situation, but do not make it even worse by spinning in endless loop. QubesOS/qubes-issues#1389 (cherry picked from commit e2da2b8)
The same applies to qrexec-agent data connection, also on DispVM restore. So it looks like vchan problem. |
It may cause DispVM IP address to be wrong - because |
A side effect of this bug seems to be: This happens 100% of the times when I launch a DispVM on a fresh 3.1rc1 installation. |
Affects me too. Qubes DB crashes. Confirmed via xl console. |
Given that this issue is intermittent, there appears to be a small race condition. I found that removing the following lines in # close() is not really needed, because the descriptor is close-on-exec
# anyway, the reason to postpone close() is that possibly xl is not done
# constructing the domain after its main process exits
# so we close() when we know the domain is up
# the successful unpause is some indicator of it
if qmemman_present:
qmemman_client.close() As the comments indicate, closing the Qubes memory manager is used as a synchronization point; however, this does not seem to be necessary and appears to cause the issue reported in this bug. To prove this, I wrote a simple shell script, available at [1]. After removing the aforementioned two lines from I also found another option (but a rather unacceptable one). Adding a sleep after resuming the virtual machine in the same file, after the following line, also resolves the issue: self.libvirt_domain.resume() |
|
Progress: that |
Generally this is qmemman problem:
When memory allocation is changed, both values are changed. But actual memory usage will change only after balloon driver will balloon up/down the VM memory (give the memory back to the hypervisor, or take it from there). Until that happens, memory assigned, but not yet allocated by the VM, is considered "free" from hypervisor point of view (as in Exactly this is happening during DispVM startup:
Note that this is nothing specific to DispVM, nor savefile usage. Any VM, with misbehaving balloon driver, could trigger such problem. The problem is that, |
Debugging hint:
And carefully observe the screen during DispVM startup. |
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
Apparently race condition mentioned here is more common that I thought. |
Currently not needed in practice, but a preparation for the next commit(s). QubesOS/qubes-issues#1389
Retrieve a domain list only after obtaining global lock. Otherwise an outdated list may be used, when a domain was introduced in the meantime (starting a new domain is done with global lock held), leading to #1389. QubesOS/qubes-issues#1389
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
I tried to get the updates but it no workie. :-( |
Wait, the package is already installed. Nice. I'm a klutz. |
When qrexec-agent crashes for any reason (for example QubesOS/qubes-issues#1389), it will never connect back and qrexec-client will wait forever. In worst case it may happen while holding qubes.xml write lock (in case of DispVM startup) effectively locking the whole system. Fixes QubesOS/qubes-issues#1636
QubesOS/qubes-issues#1389 (cherry picked from commit caa75cb)
... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431
... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431
... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431
Additionally when it happens, gui-agent spins in endless loop trying to read
QubesDB watch (waits for DispVM restore).
The text was updated successfully, but these errors were encountered: