VM instance of QubesDB (sometimes) crashes on DispVM restore #1389

marmarek · 2015-11-07T14:14:02Z

Nov 07 10:50:05 fedora-21-dvm qubesdb-daemon[155]: vchan closed
Nov 07 10:50:05 fedora-21-dvm qubesdb-daemon[155]: reconnecting
Nov 07 10:51:02 fedora-21-dvm qubesdb-daemon[155]: xc: error: xc_gnttab_map_grant_refs: mmap failed (22 = Invalid argument): Internal error
Nov 07 10:51:02 fedora-21-dvm qubesdb-daemon[155]: vchan reconnection failed

Additionally when it happens, gui-agent spins in endless loop trying to read
QubesDB watch (waits for DispVM restore).

The text was updated successfully, but these errors were encountered:

This is already quite fatal situation, but do not make it even worse by spinning in endless loop. QubesOS/qubes-issues#1389

This is already quite fatal situation, but do not make it even worse by spinning in endless loop. QubesOS/qubes-issues#1389 (cherry picked from commit e2da2b8)

marmarek · 2015-11-26T23:50:37Z

The same applies to qrexec-agent data connection, also on DispVM restore. So it looks like vchan problem.

marmarek · 2015-12-13T19:34:27Z

It may cause DispVM IP address to be wrong - because /usr/lib/qubes/setup-ip script can't get the right one from QubesDB.

i7u · 2015-12-24T14:49:55Z

A side effect of this bug seems to be:
[user@fedora-23-dvm ~]$ cat /etc/resolv.conf
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB[user@fedora-23-dvm ~]$

This happens 100% of the times when I launch a DispVM on a fresh 3.1rc1 installation.

Rudd-O · 2015-12-26T12:46:34Z

Affects me too. Qubes DB crashes. Confirmed via xl console.

m-v-b · 2016-01-04T11:32:02Z

Given that this issue is intermittent, there appears to be a small race condition.

I found that removing the following lines in 01QubesDisposableVm.py (in /usr/lib64/python2.7/site-packages/qubes/modules) resolves this issue:

# close() is not really needed, because the descriptor is close-on-exec
# anyway, the reason to postpone close() is that possibly xl is not done
# constructing the domain after its main process exits
# so we close() when we know the domain is up
# the successful unpause is some indicator of it
        if qmemman_present:
            qmemman_client.close()

As the comments indicate, closing the Qubes memory manager is used as a synchronization point; however, this does not seem to be necessary and appears to cause the issue reported in this bug.

To prove this, I wrote a simple shell script, available at [1].

After removing the aforementioned two lines from 01QubesDisposableVm.py, my script runs to completion, and more importantly, I have never encountered this issue after the proposed modification.

I also found another option (but a rather unacceptable one). Adding a sleep after resuming the virtual machine in the same file, after the following line, also resolves the issue:

    self.libvirt_domain.resume()

[1] https://gist.github.com/m-v-b/5f156ae08d089efd757f

marmarek · 2016-01-04T19:23:07Z

qmemman_client.close() basically resumes qmemman operations (dynamic management of VM memory). So maybe the problem is that memory used by vchan during connection is taken away by qmemman? That would be a kernel bug...

marmarek · 2016-01-05T04:59:31Z

Progress: that xc_gnttab_map_grant_refs: mmap failed is in fact failed GNTTABOP_map_grant_ref hypercall fail. The actual error is GNTST_no_device_space (-7) /* Out of space in I/O MMU. */ (got that using kernel patch, because it isn't logged anywhere...)

marmarek · 2016-01-05T21:50:03Z

Generally this is qmemman problem:
When VM have some memory assigned, it means two things:

upper limit on VM memory allocation
target memory size for balloon driver

When memory allocation is changed, both values are changed. But actual memory usage will change only after balloon driver will balloon up/down the VM memory (give the memory back to the hypervisor, or take it from there). Until that happens, memory assigned, but not yet allocated by the VM, is considered "free" from hypervisor point of view (as in xl info, also xl list displays actual memory usage, not "target memory"). In such case, VM is free to allocate such memory (up to assigned limit) at any time.

Exactly this is happening during DispVM startup:

QubesDisposableVM.start requests some memory (initial memory size for DispVM template - *-dvm VM) from qmemman to start new DispVM (e.g. 400MB)
DispVM is restored from a savefile, using only memory that was allocated at savefile creation time (e.g. 280MB)
Now DispVM is using some memory (280MB), but is allowed to use that initial size (400MB). The difference (120MB) is considered "free".
qmemman redistribute that free memory among other VMs, leaving 50MB safety margin
DispVM, after some time, allocate remaining memory, draining all the memory from Xen free pool
Bad Things(tm) are happening - in this case grant table operations failures
qmemman adjust memory assignments, so it looks ok a moment later (making debugging harder)

Note that this is nothing specific to DispVM, nor savefile usage. Any VM, with misbehaving balloon driver, could trigger such problem. The problem is that, qmemman doesn't handle memory assigned to some VM, but not used yet.

marmarek · 2016-01-05T21:52:25Z

Debugging hint:

watch -n 0.2 xl info\;xl list

And carefully observe the screen during DispVM startup.

marmarek · 2016-01-07T06:07:00Z

Automated announcement from builder-github

The package qubes-core-dom0-3.1.9-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

marmarek · 2016-01-12T23:32:52Z

Automated announcement from builder-github

The package qubes-core-dom0-3.1.10-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

marmarek · 2016-01-14T02:30:53Z

Apparently race condition mentioned here is more common that I thought.

QubesOS/qubes-issues#1389

Currently not needed in practice, but a preparation for the next commit(s). QubesOS/qubes-issues#1389

QubesOS/qubes-issues#1389

Retrieve a domain list only after obtaining global lock. Otherwise an outdated list may be used, when a domain was introduced in the meantime (starting a new domain is done with global lock held), leading to #1389. QubesOS/qubes-issues#1389

marmarek · 2016-01-15T16:37:50Z

Automated announcement from builder-github

The package qubes-core-dom0-3.1.11-1.fc20 has been pushed to the r3.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

marmarek · 2016-02-08T04:10:53Z

Automated announcement from builder-github

The package qubes-core-dom0-3.1.11-1.fc20 has been pushed to the r3.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Rudd-O · 2016-02-10T03:52:17Z

I tried to get the updates but it no workie. :-(

Rudd-O · 2016-02-10T03:56:26Z

Wait, the package is already installed. Nice. I'm a klutz.

When qrexec-agent crashes for any reason (for example QubesOS/qubes-issues#1389), it will never connect back and qrexec-client will wait forever. In worst case it may happen while holding qubes.xml write lock (in case of DispVM startup) effectively locking the whole system. Fixes QubesOS/qubes-issues#1636

QubesOS/qubes-issues#1389 (cherry picked from commit caa75cb)

... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431

marmarek added this to the Release 3.1 milestone Nov 7, 2015

marmarek added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core P: minor Priority: minor. The lowest priority, below "default." labels Nov 7, 2015

marmarek added the C: Xen label Nov 26, 2015

marmarek added P: critical Priority: critical. Between "major" and "blocker" in severity. and removed P: minor Priority: minor. The lowest priority, below "default." labels Dec 22, 2015

marmarek self-assigned this Jan 5, 2016

marmarek closed this as completed in marmarek/old-qubes-core-admin@181eb3e Jan 5, 2016

marmarek added the r3.1-dom0-cur-test label Jan 7, 2016

marmarek added r3.1-dom0-stable and removed r3.1-dom0-cur-test labels Jan 12, 2016

marmarek reopened this Jan 14, 2016

marmarek removed the r3.1-dom0-stable label Jan 14, 2016

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

qmemman: add some useful logging for #1389

6a99b0b

QubesOS/qubes-issues#1389

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

qmemman: use try/finally to really release the lock

3eccc3a

Currently not needed in practice, but a preparation for the next commit(s). QubesOS/qubes-issues#1389

marmarek added a commit to marmarek/old-qubes-core-admin that referenced this issue Jan 14, 2016

tests: regression test for #1389

caa75cb

QubesOS/qubes-issues#1389

marmarek closed this as completed in marmarek/old-qubes-core-admin@5d36923 Jan 14, 2016

marmarek mentioned this issue Jan 15, 2016

Occasionally qubes core gets deadlocked #1636

Closed

marmarek added the r3.1-dom0-cur-test label Jan 15, 2016

marmarek added r3.1-dom0-stable and removed r3.1-dom0-cur-test labels Feb 8, 2016

marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Feb 29, 2016

tests: regression test for #1389

a5f341d

QubesOS/qubes-issues#1389 (cherry picked from commit caa75cb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM instance of QubesDB (sometimes) crashes on DispVM restore #1389

VM instance of QubesDB (sometimes) crashes on DispVM restore #1389

marmarek commented Nov 7, 2015

marmarek commented Nov 26, 2015

marmarek commented Dec 13, 2015

i7u commented Dec 24, 2015

Rudd-O commented Dec 26, 2015

m-v-b commented Jan 4, 2016

marmarek commented Jan 4, 2016

marmarek commented Jan 5, 2016

marmarek commented Jan 5, 2016

marmarek commented Jan 5, 2016

marmarek commented Jan 7, 2016

marmarek commented Jan 12, 2016

marmarek commented Jan 14, 2016

marmarek commented Jan 15, 2016

marmarek commented Feb 8, 2016

Rudd-O commented Feb 10, 2016

Rudd-O commented Feb 10, 2016

VM instance of QubesDB (sometimes) crashes on DispVM restore #1389

VM instance of QubesDB (sometimes) crashes on DispVM restore #1389

Comments

marmarek commented Nov 7, 2015

marmarek commented Nov 26, 2015

marmarek commented Dec 13, 2015

i7u commented Dec 24, 2015

Rudd-O commented Dec 26, 2015

m-v-b commented Jan 4, 2016

marmarek commented Jan 4, 2016

marmarek commented Jan 5, 2016

marmarek commented Jan 5, 2016

marmarek commented Jan 5, 2016

marmarek commented Jan 7, 2016

marmarek commented Jan 12, 2016

marmarek commented Jan 14, 2016

marmarek commented Jan 15, 2016

marmarek commented Feb 8, 2016

Rudd-O commented Feb 10, 2016

Rudd-O commented Feb 10, 2016