Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspend to RAM randomly crashes VM in R4.1 #7283

Closed
logoerthiner1 opened this issue Feb 18, 2022 · 37 comments
Closed

Suspend to RAM randomly crashes VM in R4.1 #7283

logoerthiner1 opened this issue Feb 18, 2022 · 37 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: kernel diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.1-bookworm-stable r4.1-bullseye-stable r4.1-buster-stable r4.1-centos-stream8-stable r4.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@logoerthiner1
Copy link

logoerthiner1 commented Feb 18, 2022

How to file a helpful issue

I installed R4.1 on my new ThinkPad L15 Gen2 and figured out the way to enable Linux S3 sleep. Linux S3 sleep works on windows, but on Qubes OS (actually I was surprised to see that suspension - waking up partially works) it encountered many problems.

Basically suspend (1) immediately panics my sys-net because of a MT7921 driver NULL pointer dereference happening inside kernel-latest (5.15.14-1.fc32), (2) randomly kill some of my running appVMs, and (3) cause a long time lag when waking up (~15s, from pressing power button to xscreensaver screen, with the power button light blinking as if it was still suspending)

I would like to try handling by myself (1), and (3) does not seems easy to figure out, so this issue focus on (2).

It might be related to #6411 since my CPU is i5-1135G7.

Qubes OS release

R4.1

Brief summary

Suspend randomly shutdown running VMs, by Xen:

d?v0 Unexpected vmexit: reason 3
domain_crash called from vmx.c:4304

Steps to reproduce

Suspend, and wake up

Expected behavior

After waking up, all running VM before suspension is still running.

Actual behavior

Some random VMs are shutdown. It seems like the VM is shutdown when waking up, but I cannot rule out that the crash happens when suspending

Log files for reference

https://pastebin.mozilla.org/84hYCS5v

Edit

I tried disabling HyperThreading, and it only cuts the lag time of (3) in half; VM still randomly exits.

@logoerthiner1 logoerthiner1 added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Feb 18, 2022
@andrewdavidwong andrewdavidwong added C: kernel hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Feb 18, 2022
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Feb 18, 2022
@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 20, 2022

I update my discovery here.

I no-op'ed the internal.SuspendPost (something like this, at /usr/lib/python3.8/qubes/api/internal.py) logic and most of internal.SuspendPre (only keeping the suspend part and remove the SuspendPre message broadcasting) so that I can pause and unpause the VM manually before and after suspension, and then I suspend and wake up the machine.

The result is: waking up still takes too long time, and after waking up (all the VM are paused except for dom0), I looked at the dom0 hypervisor log, and it said things like:

[2022-02-17 20:11:38] (XEN) Enabling non-boot CPUs ...
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU1 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU2 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU3 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU4 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU5 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU6 up: -5
[2022-02-17 20:11:38] (XEN) Stuck ??
[2022-02-17 20:11:38] (XEN) Error bringing CPU7 up: -5

And then when I manually unpause one VM, the first VM I unpaused crashes in xen. The hypervisor log is like:

[2022-02-17 20:11:39] (XEN) d1v0 Unexpected vmexit: reason 3
[2022-02-17 20:11:39] (XEN) domain_crash called from vmx.c:4304
[2022-02-17 20:11:39] (XEN) Domain 1 (vcpu#0) crashed on cpu#0:
[2022-02-17 20:11:39] (XEN) ----[ Xen-4.14.3 x86_64 debug=n Not tainted ]----
[2022-02-17 20:11:39] (XEN) CPU: 0
[2022-02-17 20:11:39] (XEN) RIP: 0010:[<ffffffff890023a8>]
[2022-02-17 20:11:39] (XEN) RFLAGS: 0000000000000002 CONTEXT: hvm guest (d1v0)
[2022-02-17 20:11:39] (XEN) rax: 0000000000000001 rbx: ffffa55ac015be74 rcx: 00000000ffffffff
[2022-02-17 20:11:39] (XEN) rdx: 0000000000000000 rsi: ffffa55ac008fe44 rdi: 0000000000000002
[2022-02-17 20:11:39] (XEN) rbp: ffffa55ac015bdf4 rsp: ffffa55ac008fe38 r8: 00000897ac38a9f8
[2022-02-17 20:11:39] (XEN) r9: 0000000000000002 r10: 0000000000000000 r11: 0000000000000000
[2022-02-17 20:11:39] (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000
[2022-02-17 20:11:39] (XEN) r15: 0000000000000003 cr0: 0000000080050033 cr4: 0000000000770ef0
[2022-02-17 20:11:39] (XEN) cr3: 0000000002748001 cr2: 00006202870c6070
[2022-02-17 20:11:39] (XEN) fsb: 0000000000000000 gsb: ffff94552f600000 gss: 0000000000000000
[2022-02-17 20:11:39] (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0018 cs: 0010

Pause and unpause VM is not problematic at all when S3 sleep is not involved. Therefore I believe that xen must have some problem dealing with linux S3 sleep. Error bringing CPU1 up: -5 and d1v0 Unexpected vmexit: reason 3 indicating two problems inside xen sleep logic.

I browsed xen 4.14.4 source code shallowly and find out that:

-5 == -EIO
3 == EXIT_REASON_INIT

I am still having no clue about why the first vm resumed got killed because it vmexit with EXIT_REASON_INIT. However for the EIO part, it seems to originate from xen/arch/x86/smpboot.c:609

        if ( cpu_state == CPU_STATE_CALLIN )
        {
            /* number CPUs logically, starting from 1 (BSP is 0) */
            Dprintk("OK.\n");
            print_cpu_info(cpu);
            synchronize_tsc_master(cpu);
            Dprintk("CPU has booted.\n");
        }
        else if ( cpu_state == CPU_STATE_DEAD )
        {
            smp_rmb();
            rc = cpu_error;
        }
        else
        {
            boot_error = 1;
            smp_mb();
            if ( bootsym(trampoline_cpu_started) == 0xA5 )
                /* trampoline started but...? */
                printk("Stuck ??\n"); // <= THIS LINE
            else
                /* trampoline code not run */
                printk("Not responding.\n");
        }

so the other CPU has timed out or they did not correctly set the cpu state into CPU_STATE_CALLIN.

Maybe it is better to let a xen expert diagnose further.

Again, my CPU is i5-1135G7 and I enabled Linux S3 in my BIOS.

@andyhhp
Copy link

andyhhp commented Feb 20, 2022

The line of -EIO's to begin with means the APs aren't responding to an INIT-SIPI-SIPI sequence to start them up. Whatever the firmware has done (or not done), they're not in a working state. Edit: Not true, it turns out.

The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry".

Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change.

@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 20, 2022

The line of -EIO's to begin with means the APs aren't responding to an INIT-SIPI-SIPI sequence to start them up. Whatever the firmware has done (or not done), they're not in a working state.

The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry".

Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change.

Finally, a xen expert! Although I do not understand completely the detail, I tried disabling Intel PTT in BIOS and different things happens: when waking up, the power button light blinking time (the time for attempting to wake up other CPU cores) reduces, and then the computer immediately collapses (power light is on, screen is black, HDD halts, no response).

The stuck time reduced

Waking up does not write log file at all. I misunderstood the logs before.

(I will try once more later)

I tried a second time, and the computer bricked. It seems as if xen has overwritten my boot procedure into nop. Unplug the power plug make my computer go back to normal.

Anyway it seems that after I disabled intel PTT in my BIOS, suspend-to-RAM completely broke since the computer do not even wake up to qubes os.

I believe that when I attempt to wake up my computer, xen first try to wake up CPU1, and then found something crazy and reboot, which make the computer into a very dangerous state.

@xenixxx
Copy link

xenixxx commented Feb 20, 2022

Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter".

@logoerthiner1
Copy link
Author

Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter".

I guess that Ctrl-Alt-F2 should always work if problem limits to xscreensaver. I have tried Ctrl-Alt-F2 and it does not work at all.

@xenixxx
Copy link

xenixxx commented Feb 20, 2022

No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)

@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 20, 2022

No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)

I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1?

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

@xenixxx
Copy link

xenixxx commented Feb 20, 2022

No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0)

I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1?

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

This error occurs in versions 4.0 and 4.1. I only found one solution. Remove xscreensaver from dom0, install lightlockar as lock program and set lock command in xfce4. The worst part is the randomness of this error. Laptop wakes up for a week, then freezes and doesn't get up. After that, there was no problem anymore. It definitely occurs in lenvo laptops. I have not checked the others. It does not matter whether the bios coreboot for x230 or bios lenovo for x1 carbon. The hardware may go through a random number of sleep / wake cycles and one was sure it would crash at some point. After changing to ligtlocker, everything has been working steadily for a year.

@xenixxx
Copy link

xenixxx commented Feb 20, 2022

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.

@logoerthiner1
Copy link
Author

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.

Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit.

@xenixxx
Copy link

xenixxx commented Feb 20, 2022

I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous.

I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes.

Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit.

It's funny. the xscreensaver package cannot and will not conflict in dom0. Sure you can look for a solution. Run xscreensaver -v -log /home/user/diaglog.txt and you might be able to diagnose it exactly. In my case, looking for the problem and a few weekends did not solve the problem.

@andyhhp
Copy link

andyhhp commented Feb 20, 2022

@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.

Also, you said that Windows is fine with this. What about plain Linux?

@DemiMarie
Copy link

@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.

Does this mean this is blocked on S0ix support in Xen? Is there any chance that Xen will support S0ix in combination with PCIe pass-through with not-fully-trusted guests?

@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 20, 2022

@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition.

The device I am talking about is a Thinkpad L15 Gen 2 (This should be visible in the debug info; anyway).

I have updated the BIOS and updated BIOS provides an option of Linux S3. This is the BIOS Update documentation: https://download.lenovo.com/pccbbs/mobiles/r1juj10w.txt , and "Linux S3" can be found. Once I enabled "Linux S3", both windows and Qubes OS are able to suspend as expected (fan is not working in Linux S3), but only windows can correctly wake up.

Personally I HATE S0ix like many other people - how is S0ix different from just locking the screen, pausing every application, and close the monitor? (Actually in Qubes OS I find that I can set the screen brightness into absolute zero which may be either deemed a bug or a feature)

Also, you said that Windows is fine with this. What about plain Linux?

I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging?

Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later.

@marmarek
Copy link
Member

marmarek commented Feb 20, 2022

I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging?

You can try plain Linux by... booting Qubes without Xen - in grub drop multiboot2 line, then replace subsequent two module2 lines with linux and initrd. Obviously no VM would start, but it should allow you testing S3 with plain Linux.

@marmarek
Copy link
Member

marmarek commented Feb 20, 2022

I have a different system that behaves in a very similar way.

What about plain Linux?

It works here.

[   57.364586] PM: suspend entry (deep)
[   57.366885] Filesystems sync: 0.002 seconds
[   57.390458] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   57.391909] OOM killer disabled.
[   57.391910] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
[   57.392894] printk: Suspending console(s) (use no_console_suspend to debug)
[   57.827843] PM: suspend devices took 0.435 seconds
[   57.864540] ACPI: EC: interrupt blocked
[   57.899168] ACPI: PM: Preparing to enter system sleep state S3
[   57.900093] ACPI: EC: event blocked
[   57.900094] ACPI: EC: EC stopped
[   57.900095] ACPI: PM: Saving platform NVS memory
[   57.900097] Disabling non-boot CPUs ...
[   57.901071] IRQ 137: no longer affine to CPU1
[   57.902105] smpboot: CPU 1 is now offline
[   57.904231] IRQ 138: no longer affine to CPU2
[   57.905260] smpboot: CPU 2 is now offline
[   57.906806] IRQ 139: no longer affine to CPU3
[   57.907828] smpboot: CPU 3 is now offline
[   57.909436] IRQ 140: no longer affine to CPU4
[   57.911189] smpboot: CPU 4 is now offline
[   57.912457] IRQ 141: no longer affine to CPU5
[   57.913469] smpboot: CPU 5 is now offline
[   57.914502] IRQ 142: no longer affine to CPU6
[   57.916231] smpboot: CPU 6 is now offline
[   57.917358] IRQ 143: no longer affine to CPU7
[   57.918371] smpboot: CPU 7 is now offline
[   57.925596] ACPI: PM: Low-level resume complete
[   57.925693] ACPI: EC: EC started
[   57.925693] ACPI: PM: Restoring platform NVS memory
[   57.926746] Enabling non-boot CPUs ...
[   57.926802] x86: Booting SMP configuration:
[   57.926803] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   57.929006] CPU1 is up
[   57.929033] smpboot: Booting Node 0 Processor 2 APIC 0x2
[   57.929973] CPU2 is up
[   57.929993] smpboot: Booting Node 0 Processor 3 APIC 0x3
[   57.930902] CPU3 is up
[   57.930930] smpboot: Booting Node 0 Processor 4 APIC 0x4
[   57.932052] CPU4 is up
[   57.932071] smpboot: Booting Node 0 Processor 5 APIC 0x5
[   57.933033] CPU5 is up
[   57.933053] smpboot: Booting Node 0 Processor 6 APIC 0x6
[   57.934263] CPU6 is up
[   57.934282] smpboot: Booting Node 0 Processor 7 APIC 0x7
[   57.935311] CPU7 is up
[   57.939993] ACPI: PM: Waking up from system sleep state S3
[   57.942288] ACPI: EC: interrupt unblocked
[   57.967645] ACPI: EC: event unblocked
[   57.968000] usb usb1: root hub lost power or was reset
[   57.968005] usb usb2: root hub lost power or was reset
[   58.101944] iwlwifi 0000:00:14.3: RF_KILL bit toggled to enable radio.
[   58.114921] nvme nvme0: Shutdown timeout set to 10 seconds
[   58.116798] nvme nvme0: 8/0/0 default/read/poll queues
[   58.199513] usb 3-10: reset full-speed USB device number 4 using xhci_hcd
[   58.341506] PM: resume devices took 0.374 seconds
[   58.341801] OOM killer enabled.
[   58.341803] Restarting tasks ... done.

@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 21, 2022

Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later.

When disabling PTT, in Windows I also cannot wake up from suspension, and the phenomenon is the same (it cannot wake up and long press the power button to power off and power on does not work; unplug the AC and power on sometimes recover the system). Enabling PTT again does not work. It is when I reset my BIOS to default config that I get suspension on windows to work again. The lesson is that Intel PTT is not a thing to consider disabling when we are thinking about suspension.

However when I reset my BIOS to default config, Qubes OS line disappeared in boot order and I cannot boot from the HDD that I install Qubes OS on (SSD installs the Windows, HDD installs the Qubes OS; SSD boots and HDD does not; earlier I always boot Qubes OS with a dedicated boot term named "Qubes OS", but when I reset the BIOS, it disappears and I cannot boot Qubes OS).

I may need to take time finding a rescue disk and fix the Qubes OS booting according to https://www.qubes-os.org/doc/uefi-troubleshooting/ .

Update1: Fixed. Actually the efibootmgr command in https://www.qubes-os.org/doc/uefi-troubleshooting/ has been out-dated since Qubes OS now uses grub.

@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 21, 2022

You can try plain Linux by... booting Qubes without Xen - in grub drop multiboot2 line, then replace subsequent two module2 lines with linux and initrd. Obviously no VM would start, but it should allow you testing S3 with plain Linux.

Could you please bother explain the steps in detail? I am a newbie to hacking grub things. @marmarek
I have tried:

  1. in grub, press E to edit the selection
  2. comment out the multiboot2 line and edit the module2 lines with linux /vmlinuz-5.10.96-1.fc32.qubes.x86_64 (with or without the later trailing arguments) initrd /initramfs-5.10.96-1.fc32.qubes.x86_64.img
    And I see dracut: FATAL: Cannot unbind PCI devices and dracut: Refusing to continue and then finally reboot: System halted

@logoerthiner1
Copy link
Author

logoerthiner1 commented Feb 21, 2022

Also, you said that Windows is fine with this. What about plain Linux?

I tried Ubuntu 21.10 on ISO and found out that Linux S3 works. @andyhhp

My another discovery here is that the wireless card also work in Ubuntu 21.10, even after suspending and waking up (suspension panics sys-net; Ubuntu 21.10 has kernel 5.13 while qubes has kernel 5.15, why would a 5.15 vm crash repeatably while a 5.13 environment be good and stable?)

@logoerthiner1
Copy link
Author

For issue (1) I have tried current-testing kernel-latest-qubes-vm and it is still crashing, so I will open another issue for it.

@marmarek
Copy link
Member

It works here.

Furthermore, on plain Linux after S3 KVM still works.

@logoerthiner1
Copy link
Author

It works here.

Furthermore, on plain Linux after S3 KVM still works.

Have you tested whether suspension works in Qubes R4.0? I installed R4.1 directly and I cannot afford installing R4.0 only for testing - it would take so much time on my computer. Also, R4.0 does not have a live usb for testing.

@marmarek
Copy link
Member

I can try, but I'm pretty sure it won't work, if it manages to even boot there.

@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen has been pushed to the r4.1 testing repository for the CentOS centos-stream8 template.
To test this update, please install it with the following command:

sudo yum update --enablerepo=qubes-vm-r4.1-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component vmm-xen (including package python3-xen-4.14.4-2.fc32) has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package xen_4.14.4-2 has been pushed to the r4.1 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing buster-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@andrewdavidwong andrewdavidwong removed the waiting for upstream This issue is waiting for something from an upstream project to arrive in Qubes. Remove when closed. label Mar 9, 2022
@qubesos-bot
Copy link

Automated announcement from builder-github

The package vmm-xen has been pushed to the r4.1 stable repository for the CentOS centos-stream8 template.
To install this update, please use the standard update command:

sudo yum update

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package xen_4.14.4-2+deb10u1 has been pushed to the r4.1 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component vmm-xen (including package python3-xen-4.14.4-2.fc32) has been pushed to the r4.1 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

olafhering pushed a commit to olafhering/xen that referenced this issue Mar 27, 2022
The original shadow stack support has an error on S3 resume with very bizarre
fallout.  The BSP comes back up, but APs fail with:

  (XEN) Enabling non-boot CPUs ...
  (XEN) Stuck ??
  (XEN) Error bringing CPU1 up: -5

and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:

  (XEN) d1v0 Unexpected vmexit: reason 3
  (XEN) domain_crash called from vmx.c:4304
  (XEN) Domain 1 (vcpu#0) crashed on cpu#0:

The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch.  It is a
consequence of the APs triple faulting.

The reason the APs triple fault is because we don't tear down the stacks on
suspend.  The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.

On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.

Rework the AP bring-up path to (re)create the supervisor token.  This ensures
the primary stack is non-busy before use.

Note: There are potential issues with the IST shadow stacks too, but fixing
      those is more involved.

Fixes: b60ab42 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: QubesOS/qubes-issues#7283
Reported-by: Thiner Logoer <[email protected]>
Reported-by: Marek Marczykowski-Górecki <[email protected]>
Signed-off-by: Andrew Cooper <[email protected]>
Tested-by: Thiner Logoer <[email protected]>
Tested-by: Marek Marczykowski-Górecki <[email protected]>
Reviewed-by: Jan Beulich <[email protected]>
(cherry picked from commit 7d95892)
olafhering pushed a commit to olafhering/xen that referenced this issue Mar 27, 2022
The original shadow stack support has an error on S3 resume with very bizarre
fallout.  The BSP comes back up, but APs fail with:

  (XEN) Enabling non-boot CPUs ...
  (XEN) Stuck ??
  (XEN) Error bringing CPU1 up: -5

and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:

  (XEN) d1v0 Unexpected vmexit: reason 3
  (XEN) domain_crash called from vmx.c:4304
  (XEN) Domain 1 (vcpu#0) crashed on cpu#0:

The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch.  It is a
consequence of the APs triple faulting.

The reason the APs triple fault is because we don't tear down the stacks on
suspend.  The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.

On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.

Rework the AP bring-up path to (re)create the supervisor token.  This ensures
the primary stack is non-busy before use.

Note: There are potential issues with the IST shadow stacks too, but fixing
      those is more involved.

Fixes: b60ab42 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: QubesOS/qubes-issues#7283
Reported-by: Thiner Logoer <[email protected]>
Reported-by: Marek Marczykowski-Górecki <[email protected]>
Signed-off-by: Andrew Cooper <[email protected]>
Tested-by: Thiner Logoer <[email protected]>
Tested-by: Marek Marczykowski-Górecki <[email protected]>
Reviewed-by: Jan Beulich <[email protected]>
(cherry picked from commit 7d95892)
olafhering pushed a commit to olafhering/xen that referenced this issue Mar 27, 2022
The original shadow stack support has an error on S3 resume with very bizarre
fallout.  The BSP comes back up, but APs fail with:

  (XEN) Enabling non-boot CPUs ...
  (XEN) Stuck ??
  (XEN) Error bringing CPU1 up: -5

and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:

  (XEN) d1v0 Unexpected vmexit: reason 3
  (XEN) domain_crash called from vmx.c:4304
  (XEN) Domain 1 (vcpu#0) crashed on cpu#0:

The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch.  It is a
consequence of the APs triple faulting.

The reason the APs triple fault is because we don't tear down the stacks on
suspend.  The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.

On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.

Rework the AP bring-up path to (re)create the supervisor token.  This ensures
the primary stack is non-busy before use.

Note: There are potential issues with the IST shadow stacks too, but fixing
      those is more involved.

Fixes: b60ab42 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: QubesOS/qubes-issues#7283
Reported-by: Thiner Logoer <[email protected]>
Reported-by: Marek Marczykowski-Górecki <[email protected]>
Signed-off-by: Andrew Cooper <[email protected]>
Tested-by: Thiner Logoer <[email protected]>
Tested-by: Marek Marczykowski-Górecki <[email protected]>
Reviewed-by: Jan Beulich <[email protected]>
(cherry picked from commit 7d95892)
andyhhp added a commit to andyhhp/xen that referenced this issue Mar 24, 2023
In VMX operation, the handling of INIT IPIs is changed.  Instead of the CPU
resetting, the next VMEntry fails with EXIT_REASON_INIT.  From the TXT spec,
the intent of this behaviour is so that an entity which cares can scrub
secrets from RAM before participating in an orderly shutdown.

Right now, Xen's behaviour is that when an INIT arrives, the HVM VM which
schedules next is killed (citing an unknown VMExit), *and* we ignore the INIT
and continue blindly onwards anyway.

This patch addresses only the first of these two problems by ignoring the INIT
and continuing without crashing the VM in question.

The second wants addressing too, just as soon as we've figured out something
better to do...

Discovered as collateral damage from when an AP triple faults on S3 resume on
Intel TigerLake platforms.

Link: QubesOS/qubes-issues#7283
Signed-off-by: Andrew Cooper <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
olafhering pushed a commit to olafhering/xen that referenced this issue Mar 31, 2023
In VMX operation, the handling of INIT IPIs is changed.  Instead of the CPU
resetting, the next VMEntry fails with EXIT_REASON_INIT.  From the TXT spec,
the intent of this behaviour is so that an entity which cares can scrub
secrets from RAM before participating in an orderly shutdown.

Right now, Xen's behaviour is that when an INIT arrives, the HVM VM which
schedules next is killed (citing an unknown VMExit), *and* we ignore the INIT
and continue blindly onwards anyway.

This patch addresses only the first of these two problems by ignoring the INIT
and continuing without crashing the VM in question.

The second wants addressing too, just as soon as we've figured out something
better to do...

Discovered as collateral damage from when an AP triple faults on S3 resume on
Intel TigerLake platforms.

Link: QubesOS/qubes-issues#7283
Signed-off-by: Andrew Cooper <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
master commit: b1f1127
master date: 2023-03-24 22:49:58 +0000
olafhering pushed a commit to olafhering/xen that referenced this issue Mar 31, 2023
In VMX operation, the handling of INIT IPIs is changed.  Instead of the CPU
resetting, the next VMEntry fails with EXIT_REASON_INIT.  From the TXT spec,
the intent of this behaviour is so that an entity which cares can scrub
secrets from RAM before participating in an orderly shutdown.

Right now, Xen's behaviour is that when an INIT arrives, the HVM VM which
schedules next is killed (citing an unknown VMExit), *and* we ignore the INIT
and continue blindly onwards anyway.

This patch addresses only the first of these two problems by ignoring the INIT
and continuing without crashing the VM in question.

The second wants addressing too, just as soon as we've figured out something
better to do...

Discovered as collateral damage from when an AP triple faults on S3 resume on
Intel TigerLake platforms.

Link: QubesOS/qubes-issues#7283
Signed-off-by: Andrew Cooper <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
master commit: b1f1127
master date: 2023-03-24 22:49:58 +0000
@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: kernel diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. r4.1-bookworm-stable r4.1-bullseye-stable r4.1-buster-stable r4.1-centos-stream8-stable r4.1-dom0-stable T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

8 participants