-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspend to RAM randomly crashes VM in R4.1 #7283
Comments
I update my discovery here. I no-op'ed the internal.SuspendPost (something like this, at /usr/lib/python3.8/qubes/api/internal.py) logic and most of internal.SuspendPre (only keeping the suspend part and remove the SuspendPre message broadcasting) so that I can pause and unpause the VM manually before and after suspension, and then I suspend and wake up the machine. The result is: waking up still takes too long time, and after waking up (all the VM are paused except for dom0), I looked at the dom0 hypervisor log, and it said things like:
And then when I manually unpause one VM, the first VM I unpaused crashes in xen. The hypervisor log is like:
Pause and unpause VM is not problematic at all when S3 sleep is not involved. Therefore I believe that xen must have some problem dealing with linux S3 sleep. I browsed xen 4.14.4 source code shallowly and find out that:
I am still having no clue about why the first vm resumed got killed because it vmexit with if ( cpu_state == CPU_STATE_CALLIN )
{
/* number CPUs logically, starting from 1 (BSP is 0) */
Dprintk("OK.\n");
print_cpu_info(cpu);
synchronize_tsc_master(cpu);
Dprintk("CPU has booted.\n");
}
else if ( cpu_state == CPU_STATE_DEAD )
{
smp_rmb();
rc = cpu_error;
}
else
{
boot_error = 1;
smp_mb();
if ( bootsym(trampoline_cpu_started) == 0xA5 )
/* trampoline started but...? */
printk("Stuck ??\n"); // <= THIS LINE
else
/* trampoline code not run */
printk("Not responding.\n");
} so the other CPU has timed out or they did not correctly set the cpu state into CPU_STATE_CALLIN. Maybe it is better to let a xen expert diagnose further. Again, my CPU is i5-1135G7 and I enabled Linux S3 in my BIOS. |
The VM "crashing" actually has nothing to do with the VM, or Xen really. VMEXIT_REASON_INIT means "an INIT IPI has arrived on this CPU at some point since your last VMEntry". Do you have TXT enabled in firmware? The purpose of the VMEXIT_REASON_INIT comes from TXT and means "scrub your secrets from RAM, then reset". It's not supported by Xen, and if it were, you'd have a (hopefully clean) restart, rather than a resume. Also, in TXT, the rules for booting APs change. |
Finally, a xen expert! Although I do not understand completely the detail, I tried disabling Intel PTT in BIOS and different things happens: when waking up, the power button light blinking time (the time for attempting to wake up other CPU cores) reduces, and then the computer immediately collapses (power light is on, screen is black, HDD halts, no response). The stuck time reduced Waking up does not write log file at all. I misunderstood the logs before. (I will try once more later) I tried a second time, and the computer bricked. It seems as if xen has overwritten my boot procedure into nop. Unplug the power plug make my computer go back to normal. Anyway it seems that after I disabled intel PTT in my BIOS, suspend-to-RAM completely broke since the computer do not even wake up to qubes os. I believe that when I attempt to wake up my computer, xen first try to wake up CPU1, and then found something crazy and reboot, which make the computer into a very dangerous state. |
Try to remove xscreensaver from dom0, if its work problem is locker. In my x230 and X1 Carbon 5th this happen when i install light-locker and remove xscreensaver package then set general/LockCommand to "dm-tool switch-to-greeter". |
I guess that Ctrl-Alt-F2 should always work if problem limits to xscreensaver. I have tried Ctrl-Alt-F2 and it does not work at all. |
No, you cant change to consoles because xscreensaver lock this. Remove xscreensaver and check. Then install another locker. I testing light locker on x230 more thanyear and its work witchout frozing qubes. LockCommand must to set to dm-tool switch-to-greeter (xfconf-query -c xfce4-session -p /general/LockCommand -s "dm-tool switch-to-greeter" in dom0) |
I do not think this would work since I can hear hardware resetting when I attempt to wake up - even a malfunctioning screen saver do not reset the machine and the hardwares; by the way are you talking about R4.1? I may try changing a screen saver when I have run out of ways since I think that REMOVING packages in dom0 without proper planning is dangerous. |
This error occurs in versions 4.0 and 4.1. I only found one solution. Remove xscreensaver from dom0, install lightlockar as lock program and set lock command in xfce4. The worst part is the randomness of this error. Laptop wakes up for a week, then freezes and doesn't get up. After that, there was no problem anymore. It definitely occurs in lenvo laptops. I have not checked the others. It does not matter whether the bios coreboot for x230 or bios lenovo for x1 carbon. The hardware may go through a random number of sleep / wake cycles and one was sure it would crash at some point. After changing to ligtlocker, everything has been working steadily for a year. |
I don't know what is dangerous in your opinion. I do not see a security issue in removing the xscreensaver package from qubes. |
Removing any package in dom0 is a dangerous action since dom0 may not boot any longer (imagine you accidentally uninstalled libc; linux distribution installs package stably but uninstalling is less tested and may lead to problem). I would rather set the lock screen command to be empty in order to find out whether the screen locker is the culprit. |
It's funny. the xscreensaver package cannot and will not conflict in dom0. Sure you can look for a solution. Run xscreensaver -v -log /home/user/diaglog.txt and you might be able to diagnose it exactly. In my case, looking for the problem and a few weekends did not solve the problem. |
@logoerthiner1 What hardware/vendor are you running on? I don't actually see this anywhere in the ticket. Your CPU is a TigerLake and Intel no longer support S3 on Tigerlake, so this "Linux S3" option is probably an OEM addition. Also, you said that Windows is fine with this. What about plain Linux? |
Does this mean this is blocked on S0ix support in Xen? Is there any chance that Xen will support S0ix in combination with PCIe pass-through with not-fully-trusted guests? |
The device I am talking about is a Thinkpad L15 Gen 2 (This should be visible in the debug info; anyway). I have updated the BIOS and updated BIOS provides an option of Linux S3. This is the BIOS Update documentation: https://download.lenovo.com/pccbbs/mobiles/r1juj10w.txt , and "Linux S3" can be found. Once I enabled "Linux S3", both windows and Qubes OS are able to suspend as expected (fan is not working in Linux S3), but only windows can correctly wake up. Personally I HATE S0ix like many other people - how is S0ix different from just locking the screen, pausing every application, and close the monitor? (Actually in Qubes OS I find that I can set the screen brightness into absolute zero which may be either deemed a bug or a feature)
I have not installed a plain linux there ... If this information is of great importance, I may try a ubuntu booting media (dd an Ubuntu 21.10 installer onto a usb drive, for example) or something similar. Any other thoughts on testing or debugging? Also it may be useful to test whether windows work when I disable Intel PTT. I will test it later. |
You can try plain Linux by... booting Qubes without Xen - in grub drop |
I have a different system that behaves in a very similar way.
It works here.
|
When disabling PTT, in Windows I also cannot wake up from suspension, and the phenomenon is the same (it cannot wake up and long press the power button to power off and power on does not work; unplug the AC and power on sometimes recover the system). Enabling PTT again does not work. It is when I reset my BIOS to default config that I get suspension on windows to work again. The lesson is that Intel PTT is not a thing to consider disabling when we are thinking about suspension. However when I reset my BIOS to default config, Qubes OS line disappeared in boot order and I cannot boot from the HDD that I install Qubes OS on (SSD installs the Windows, HDD installs the Qubes OS; SSD boots and HDD does not; earlier I always boot Qubes OS with a dedicated boot term named "Qubes OS", but when I reset the BIOS, it disappears and I cannot boot Qubes OS). I may need to take time finding a rescue disk and fix the Qubes OS booting according to https://www.qubes-os.org/doc/uefi-troubleshooting/ . Update1: Fixed. Actually the |
Could you please bother explain the steps in detail? I am a newbie to hacking grub things. @marmarek
|
I tried Ubuntu 21.10 on ISO and found out that Linux S3 works. @andyhhp My another discovery here is that the wireless card also work in Ubuntu 21.10, even after suspending and waking up (suspension panics sys-net; Ubuntu 21.10 has kernel 5.13 while qubes has kernel 5.15, why would a 5.15 vm crash repeatably while a 5.13 environment be good and stable?) |
For issue (1) I have tried current-testing kernel-latest-qubes-vm and it is still crashing, so I will open another issue for it. |
Furthermore, on plain Linux after S3 KVM still works. |
Have you tested whether suspension works in Qubes R4.0? I installed R4.1 directly and I cannot afford installing R4.0 only for testing - it would take so much time on my computer. Also, R4.0 does not have a live usb for testing. |
I can try, but I'm pretty sure it won't work, if it manages to even boot there. |
Automated announcement from builder-github The package
|
Automated announcement from builder-github The component
|
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
|
Automated announcement from builder-github The component
Or update dom0 via Qubes Manager. |
The original shadow stack support has an error on S3 resume with very bizarre fallout. The BSP comes back up, but APs fail with: (XEN) Enabling non-boot CPUs ... (XEN) Stuck ?? (XEN) Error bringing CPU1 up: -5 and then later (on at least two Intel TigerLake platforms), the next HVM vCPU to be scheduled on the BSP dies with: (XEN) d1v0 Unexpected vmexit: reason 3 (XEN) domain_crash called from vmx.c:4304 (XEN) Domain 1 (vcpu#0) crashed on cpu#0: The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the scheduled vCPU, and will be addressed in a subsequent patch. It is a consequence of the APs triple faulting. The reason the APs triple fault is because we don't tear down the stacks on suspend. The idle/play_dead loop is killed in the middle of running, meaning that the supervisor token is left busy. On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because the IDT isn't configured this early. Rework the AP bring-up path to (re)create the supervisor token. This ensures the primary stack is non-busy before use. Note: There are potential issues with the IST shadow stacks too, but fixing those is more involved. Fixes: b60ab42 ("x86/shstk: Activate Supervisor Shadow Stacks") Link: QubesOS/qubes-issues#7283 Reported-by: Thiner Logoer <[email protected]> Reported-by: Marek Marczykowski-Górecki <[email protected]> Signed-off-by: Andrew Cooper <[email protected]> Tested-by: Thiner Logoer <[email protected]> Tested-by: Marek Marczykowski-Górecki <[email protected]> Reviewed-by: Jan Beulich <[email protected]> (cherry picked from commit 7d95892)
The original shadow stack support has an error on S3 resume with very bizarre fallout. The BSP comes back up, but APs fail with: (XEN) Enabling non-boot CPUs ... (XEN) Stuck ?? (XEN) Error bringing CPU1 up: -5 and then later (on at least two Intel TigerLake platforms), the next HVM vCPU to be scheduled on the BSP dies with: (XEN) d1v0 Unexpected vmexit: reason 3 (XEN) domain_crash called from vmx.c:4304 (XEN) Domain 1 (vcpu#0) crashed on cpu#0: The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the scheduled vCPU, and will be addressed in a subsequent patch. It is a consequence of the APs triple faulting. The reason the APs triple fault is because we don't tear down the stacks on suspend. The idle/play_dead loop is killed in the middle of running, meaning that the supervisor token is left busy. On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because the IDT isn't configured this early. Rework the AP bring-up path to (re)create the supervisor token. This ensures the primary stack is non-busy before use. Note: There are potential issues with the IST shadow stacks too, but fixing those is more involved. Fixes: b60ab42 ("x86/shstk: Activate Supervisor Shadow Stacks") Link: QubesOS/qubes-issues#7283 Reported-by: Thiner Logoer <[email protected]> Reported-by: Marek Marczykowski-Górecki <[email protected]> Signed-off-by: Andrew Cooper <[email protected]> Tested-by: Thiner Logoer <[email protected]> Tested-by: Marek Marczykowski-Górecki <[email protected]> Reviewed-by: Jan Beulich <[email protected]> (cherry picked from commit 7d95892)
The original shadow stack support has an error on S3 resume with very bizarre fallout. The BSP comes back up, but APs fail with: (XEN) Enabling non-boot CPUs ... (XEN) Stuck ?? (XEN) Error bringing CPU1 up: -5 and then later (on at least two Intel TigerLake platforms), the next HVM vCPU to be scheduled on the BSP dies with: (XEN) d1v0 Unexpected vmexit: reason 3 (XEN) domain_crash called from vmx.c:4304 (XEN) Domain 1 (vcpu#0) crashed on cpu#0: The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the scheduled vCPU, and will be addressed in a subsequent patch. It is a consequence of the APs triple faulting. The reason the APs triple fault is because we don't tear down the stacks on suspend. The idle/play_dead loop is killed in the middle of running, meaning that the supervisor token is left busy. On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because the IDT isn't configured this early. Rework the AP bring-up path to (re)create the supervisor token. This ensures the primary stack is non-busy before use. Note: There are potential issues with the IST shadow stacks too, but fixing those is more involved. Fixes: b60ab42 ("x86/shstk: Activate Supervisor Shadow Stacks") Link: QubesOS/qubes-issues#7283 Reported-by: Thiner Logoer <[email protected]> Reported-by: Marek Marczykowski-Górecki <[email protected]> Signed-off-by: Andrew Cooper <[email protected]> Tested-by: Thiner Logoer <[email protected]> Tested-by: Marek Marczykowski-Górecki <[email protected]> Reviewed-by: Jan Beulich <[email protected]> (cherry picked from commit 7d95892)
In VMX operation, the handling of INIT IPIs is changed. Instead of the CPU resetting, the next VMEntry fails with EXIT_REASON_INIT. From the TXT spec, the intent of this behaviour is so that an entity which cares can scrub secrets from RAM before participating in an orderly shutdown. Right now, Xen's behaviour is that when an INIT arrives, the HVM VM which schedules next is killed (citing an unknown VMExit), *and* we ignore the INIT and continue blindly onwards anyway. This patch addresses only the first of these two problems by ignoring the INIT and continuing without crashing the VM in question. The second wants addressing too, just as soon as we've figured out something better to do... Discovered as collateral damage from when an AP triple faults on S3 resume on Intel TigerLake platforms. Link: QubesOS/qubes-issues#7283 Signed-off-by: Andrew Cooper <[email protected]> Reviewed-by: Kevin Tian <[email protected]>
In VMX operation, the handling of INIT IPIs is changed. Instead of the CPU resetting, the next VMEntry fails with EXIT_REASON_INIT. From the TXT spec, the intent of this behaviour is so that an entity which cares can scrub secrets from RAM before participating in an orderly shutdown. Right now, Xen's behaviour is that when an INIT arrives, the HVM VM which schedules next is killed (citing an unknown VMExit), *and* we ignore the INIT and continue blindly onwards anyway. This patch addresses only the first of these two problems by ignoring the INIT and continuing without crashing the VM in question. The second wants addressing too, just as soon as we've figured out something better to do... Discovered as collateral damage from when an AP triple faults on S3 resume on Intel TigerLake platforms. Link: QubesOS/qubes-issues#7283 Signed-off-by: Andrew Cooper <[email protected]> Reviewed-by: Kevin Tian <[email protected]> master commit: b1f1127 master date: 2023-03-24 22:49:58 +0000
In VMX operation, the handling of INIT IPIs is changed. Instead of the CPU resetting, the next VMEntry fails with EXIT_REASON_INIT. From the TXT spec, the intent of this behaviour is so that an entity which cares can scrub secrets from RAM before participating in an orderly shutdown. Right now, Xen's behaviour is that when an INIT arrives, the HVM VM which schedules next is killed (citing an unknown VMExit), *and* we ignore the INIT and continue blindly onwards anyway. This patch addresses only the first of these two problems by ignoring the INIT and continuing without crashing the VM in question. The second wants addressing too, just as soon as we've figured out something better to do... Discovered as collateral damage from when an AP triple faults on S3 resume on Intel TigerLake platforms. Link: QubesOS/qubes-issues#7283 Signed-off-by: Andrew Cooper <[email protected]> Reviewed-by: Kevin Tian <[email protected]> master commit: b1f1127 master date: 2023-03-24 22:49:58 +0000
How to file a helpful issue
I installed R4.1 on my new ThinkPad L15 Gen2 and figured out the way to enable Linux S3 sleep. Linux S3 sleep works on windows, but on Qubes OS (actually I was surprised to see that suspension - waking up partially works) it encountered many problems.
Basically suspend (1) immediately panics my sys-net because of a MT7921 driver NULL pointer dereference happening inside kernel-latest (5.15.14-1.fc32), (2) randomly kill some of my running appVMs, and (3) cause a long time lag when waking up (~15s, from pressing power button to xscreensaver screen, with the power button light blinking as if it was still suspending)
I would like to try handling by myself (1), and (3) does not seems easy to figure out, so this issue focus on (2).
It might be related to #6411 since my CPU is i5-1135G7.
Qubes OS release
R4.1
Brief summary
Suspend randomly shutdown running VMs, by Xen:
Steps to reproduce
Suspend, and wake up
Expected behavior
After waking up, all running VM before suspension is still running.
Actual behavior
Some random VMs are shutdown. It seems like the VM is shutdown when waking up, but I cannot rule out that the crash happens when suspending
Log files for reference
https://pastebin.mozilla.org/84hYCS5v
Edit
I tried disabling HyperThreading, and it only cuts the lag time of (3) in half; VM still randomly exits.
The text was updated successfully, but these errors were encountered: