-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasionally unable to wake up from suspension to RAM #7340
Comments
I meant rare at datacenter scale, not "can be reliably repro'd on a single system". Also, the way it would go wrong would be a crash some point after resume, so there would be logging from Xen
If waiting 12h really does make the difference between it working and not, then this is almost certainly a Lenovo firmware bug. |
I have recently reproduced the issue after 4~5 suspension within 1 hours (when I was testing about another thing) so this bug does not seem to rely on long timing; When I reproduced that, later I found that CPU / Fan is unreasonably hot, I am not sure whether it has anything to do with the issue. After I inspect the log file more carefully I find out something different.
For example, each suspend-resume loop will be shown in Xen log with the line 1-8. However when the fatal suspension (the suspension that never wakes up) happens, Xen log does not even append (1) into its log file, not mentioning later lines. Dom0 log has something related to that sleep:
I have observed similar logging patterns in the other fatal suspension cases. Any instructions on further investigations? Or any clue on why xen does not log about the fatal suspension? |
I can reproduce it on my TGL system, with totally different firmware. I have a serial console, and not much more details there. Xen stops responding to triple ctrl-a when in hangs. |
May I ask whether you have any progress on investigation of the issue? @marmarek |
Update: I have updated with Since the bug appears very occasionally I will try later to check out whether the new update really fixes the issue. A week or two, if this issue does not pop out, I will believe that this issue is solved. |
Would you be willing to try kernel and/or firmware packages from |
[quote]
Would you be willing to try kernel and/or firmware packages from `qubes-dom0-current-testing`? They are fully security supported, so you should not be making your system less secure by doing so.
[/quote]
Is this true?
I don't know what "security supported" means, but surely the
point in having current-testing is to identify any issues (including
security issues) before packages hit current.
It's at least possible that packages in current-testing have introduced
security issues, no?
|
According to a past email from Marek a security vulnerability in current-testing would result in a QSB being issued, even if the vulnerable package has not hit stable yet. |
Yes. |
Quick update: This issue rarely persists now - and the case has changed. There is one time recently that, when I clicked on "Suspend", the power light blinks and blinks forever. |
Latest experiment: I have tried S3 sleep on my machine in both Windows 10 and Ubuntu 22.04 Live ISO, and both works without any problem and are fairly stable. (Note: their suspension does not contain a blinking power light) Qubes OS (where S3 sleep is supported in Xen) still occasionally fails (once in 10 times or 20 times), either failed to suspend or to resume. Therefore it should be a Xen problem rather than a firmware problem.
I have even tried suspension right before logging in, at tty2. I keep Death can occur when suspending or when resuming. (1) When the machine dies on suspending, the power light keeps blinking forever, and numlock does not work, machine does not respond to anything. This instability of S3 sleep of Qubes OS R4.1 have caused great pain on frequent LVM corruption and caused some data loss due to unsaved progress during the half year. I sincerely hope that developers can investigate further to solve the problem. @marmarek may I ask the progress of the investigation of the bug? Can you reproduce this bug on more other machines, using Intel Tigerlake CPU or not, by repeatedly suspending & resuming? Is it related to the new xen version? Are there anything I can do that can be of help for the bug? |
Please file a separate bug for this. LVM volumes should not be corrupted even if the power is yanked in the middle of an operation. |
I do not have a reproduceable way of triggering the exact same behavior nor do I want to try reproducing it on my working machine considering I do not have a zoo of machines. However similar bugs might need to be considered, when thinking about stability of Qubes OS on various hardware failures. I will try filing such "bugs". Update: the separate bug (if it is) is #7800 |
This bug has bothered me for half a year, I can reliably trigger the bug every day, but I do not see any later discussion on the bug itself. May I ask the current status of the bug?
|
I don't have this issue on any other system, TGL I have (Framework) doesn't have any issues with S3 anymore. |
I have tried the Xen 4.14.5-11 by downloading manually the xen-hypervisor rpm and put its xen-4.14.5.gz onto No additional log is generated when the suspend problem happens, so I believe that it is likely that the computer is trapped into a ring -1 self-loop or anything like this. I wonder whether there are some suggested ways of finding out what happened when the computer freezes out, for example a watchdog reside in xen that can generate a coredump for further analysis. Do you have any suggestions? @marmarek @andyhhp @DemiMarie |
This is either a Xen bug or a Linux kernel bug, but I am not familiar enough with low-level matters to go further. |
First, you probably want a version of Xen with the S3/timer issue fixed. @marmarek Do you have a build to hand? I'm not sure that will make a difference in this case, but let's rule things out one at a time. I expect this is a Linux bug. One thing does look curious. In the case that everything is wedged, Xen does (initially) respond to debugkeys. When using the 'd' key, we get
which in principle is fine - that looks like it hit a page boundary, and the adjacent page can indeed be unmapped. But moments later when using '0', we get
and at this point, you say Xen ceases responding to anything? It's certainly suspicious that it's the same vCPU that hit the page boundary. I know it's tangential to the issue you're trying to debug, but can you set up a watchdog (simply |
A second trace log for a hanged resuming - recovered by a xen print all diagnostics
|
Yes the xhci console does not respond to any keypress but the console itself is on (not disconnected from debugee side).
Glad to learn the "watchdog" parameter. I will have a try. As the suspend issue can be very complex and composed of several separated issue, I will not be surprised if one issue is related to a lock-up xen (actually this is what I have been suspecting). |
@marmarek @andyhhp xen+dom0 log for a hanged dom0
|
TEST CONDITION: same with one core log
and then I suspend here and the computer lag a while and then: after a while this appears
And after a while I attempt to triple ctrl-a, it works and surprisingly triple ctrl-a and '*' (print all diagnostics) solves the kernel freeze and let the computer sleep (xen "print all diagnostics" have then solved dom0 hang many times - I suspect that this could be hint from some workaround or reason of soft lock): xen print all diagnostics solves the hang!?
Followups: When the computer resume, the screen is on, but computer does not respond to keyboard; the xhci side can be switched between xen and kernel, 'h' works, but on '*' the xen is stuck and watchdog does not help either (watchdog is always on btw) hdd does not seem to be spinning as when I hard reset the machine, there is no sudden hdd retract sound (when xen reset the machine, hdd will retract suddenly) Followup2 A similar issue happens on suspend/resume in the next boot. A dom0 kernel xen_safe_halt CPU soft lockup stack trace appears in dmesg (UPDATE: this trace appeared again and many times with nearly the same stack trace on one next experiment): log
Followup3 After a resume-suspend, the machine locks up unable to perform the next resume; xen fails to dump the dom0 culprit core either. log
|
By the way here is a kernel trace that always appear on the first suspend of machine - maybe better give it a fix in order to eliminate the noise (this error always taints the kernel but it does not seems to be a big issue) long existing kernel warning on first suspend&resume
|
When I am using picocom, sometimes picocom will initially send something automatically through the console (usually the first ~120 characters that picocom receives on that session) and the terminal is a mess. It is a disaster when the initial receiver of console is xen rather than dom0, since the data sent back is parsed as commands, so xen get crazy and prints a lot of stuff and finally when he happens to parse an 'R', the computer halts. |
TEST CONDITION: same with one core python3 -c "while 1:pass" log, this time xen hangs while dom0 kernel is responsive
After a while (I executed log, continued, notice the timestamp
I wonder how xen command '0' works - it seems that sometimes when dom0 hangs, '0' can solve the hang. Followup: I played with the command '0' and it seems to solved the hang; I played with the command '0' again when dom0 is responsive, and then xen hangs with a real backtrace. death logs
And the computer does not reboot despite xen says to reboot in 5 secs. |
I have tried the whole day on kernel 6.0.2-2 but still fails to trigger the resume failure case (I have only trigger the suspension failure). I switched to 5.15.74 and after a few (around ten?) tries I finally triggered the exact bug (I believe). Fortunately (is it?) the bug behavior is similar to the older cases - after a triple ctrl-a, xen responds to key press, however when I press '0', xen stuck on the death dom0 core. On resume, xen writes to the serial port, but the dom0 is not waken up and produce no log. Detail: resume death, and I have tried xen com
and then the xen serial port stucks forever (it is not disconnected though; xen only disconnect the xhci serial port when machine suspends; when it stucks, the port is always connected) watchdog DOES NOT pop out this time despite I have checked the log that I have added it into xen command line parameters. Anyway I have 90% confidence that this problem is exactly what I am facing and what I have been headaching with through the half year. I will try once more on 5.15.74 to confirm this. Followup: I tried a second time and confirm that the bug is reproduceable and can be characterized. Later I will describe the bug concisely. Here is a better log - this time xen watchdog is working and the computer reboots. when using xen command '*' xen lock up on reading dom0 registers
|
Summary about the bug. The bug is complex and can be triggered in various ways. TRIGGER: on my Thinkpad L15 Gen2, the bug can be triggered on 6.0.2 and 5.15.74, in various ways. On 6.0.2, when I repeatedly suspend and resume, there will usually be one time when suspend fails, and one xen CPU (CPU i) runs as some dom0 vcpu (d0 vj) forever, and when I typed '*' in xen console, xen tries to find out the dom0 register of the j-th dom0 vcpu, and it hangs there and usually this hang can be caught by watchdog. Dom0 usually does not respond to anything. (When I do not run anything, this bug is hard to trigger, though not impossible. However in real life dom0 usually runs various stuff, including xorg and guid and audio daemon and storage drivers, and any of it could make suspension problem more likely to appear.) On 5.15.74, when I repeatedly suspend and resume, there will usually be one time when resume fails, same behavior (one xen cpu runs as one dom0 vcpu forever). Dom0 usually does not respond to anything. This is the original issue I am mentioning, and should apply from a very early 5.15 kernel version to this version. On 6.0.2, when I run There are a great number of log files today along with tons of at's, I am sorry if this bothers you. Let me know what I can do more. Later I will go for the stuck RIP of d0vj and try to find out the calling stack. Update: I give up on this. Qubes OS kernel does not have a vmlinux provided, and even though kaslr is disabled globally, it is hard to get a pretty kernel backtrace from a list of integers. I can barely make out that in both versions, the function that hogs the CPU is function backtraces of dom0 vcpu in xenkernel 6.0.2-2
kernel 5.15.74-2:
Update2: I happened to use another way to determine the backtrace. 5.15.74 is easier to trigger the lockup bug, so I turn to this version and suspend-resume until lockup happens, and then I triple ctrl-a into xen and tried various keys (virtually all keys except for '*', 'R', and '0'). One interesting key is 'N' - it triggers an NMI. After pressing all keys, I pressed '0', and am surprised to find that xen does not lockup; instead the whole kernel log appears. Basically, the xen cpu xen view of the cpu
the kernel dump kernel view of the cpu
This may indicate that whenever there is a lockup, I can first 'N' (trigger an NMI) then '0' (dump dom0 registers) to recover from it. (Why is '0' effective - it should be a vanilla function to print information, why can it unlock the cpus?) Update3: I have tried once more and it shows that I can use '%' 'd' 'N' 'N' '0' in xen console to recover this lockup (lockup after resume; this does not apply for the other lockup happening before sleeping). kernel view of the cpu
Update4: I tried once more and I found out that when the machine hangs on resuming, '0' itself is enough. one '0' unlocks the kernel
|
One more trace suspension failure log
|
hmm
says that the CPU is waiting for the sibling to call in. In which case this is perhaps more likely related to the Xen S3/timer issue recently debugged and fixed upstream. |
I would be glad to test out the updated xen blob if it comes in time. However it seems that xen watchdog does not only caught this backtrace you mentioned - for example (taken out from log I have posted earlier this day):
or
|
I am experimenting with the xen 4.14.5-12. kernel 5.15.74: same problem happens: 5.15.74 resume lockup, solved with key '0'
and then the laptop builtin keyboard does not respond. I tried doing poweroff on serial port and then another function locks up on similar lockup story but this time with shutdown process
kernel 6.0.2 lock up on resuming, when typing '0' the cpu gets unlocked and compter get responsive
|
@andyhhp @marmarek I have tested it on 5.15.74 6.0.2 with or without the This patch seems very promising. I will test more on later days. |
How to file a helpful issue
Additional bug that triggers once about every week.
The original fix says that there are rare cases that it still does not work; however does that case match the problem I am facing?
Qubes OS release
R4.1
Brief summary
Suspension works most of the time in Thinkpad L15 Gen 2 using the temporary patch in #7283 but it breaks occasionally (
once in a weekonce in about 10 times) by being unable to wake up. The power light may blink or keep lighting, the screen is closing, the fan is working but HDD is NOT spinning.Later testing shows that it is unrelated to appVM, bot only to dom0 and xen.
Same bug does not exists on LiveCD Ubuntu 22.04 or Windows.
old summary
Forcefully power off the computer is possible but computer does not boot when powering on later, for around several minutes, and then it recovers and boots up.As far as I am concerned, ABSOLUTELY no logs are available in xen or any VM.
The problem is similar to when I attempted to disable Intel PTT (that time both Windows and Qubes OS does not wake up and keeps unable to boot for several minutes), only that it appears occasionally and I cannot trigger it in Windows.
I have contacted Lenovo custom service and they says that it is related to either graphical card driver or power supply driver, and I have installed the latest driver under Windows, which may be the reason that windows does not crash.
Update: each problem occur when waking up over 12 hours later from suspend; not sure whether very frequent suspend & wake up will cause problem.
Steps to reproduce
old steps to reproduce
1. On Thinkpad L15 Gen 2, enable "Linux S3", In R4.1, apply patch in #7283 2. suspend (S3) and wake up for several timessudo systemctl suspend
Expected behavior
Each time computer wakes up.
Actual behavior
old actual behavior
The computer wakes up many times and then it does not wake up and needs to power off and power on a number of times.After an average of 10 times success of resuming, the computer finally gets either:
both of which I need to hard shutdown the computer by long pressing the power button.
The freeze does not generate any meaningful log info. The log content of various log files are exactly the state before the machine suspends, without any suspicious failure events.
This behavior is reproduced on xen-hypervisor from 4.14.4-2 to 4.14.5-7 .
@andyhhp @marmarek Can you reproduce the issue?
The text was updated successfully, but these errors were encountered: