-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Screen does not wake up after resume (AMD Ryzen 7 Pro 4750U) #6923
Comments
The Xen processor (-19) from ACPI errors go away if I boot the kernel with nosmt, obviously. In the console with lightdm never started it can survive at least 5-6 suspend-resume-cycles now. Now compiling the kernel with |
There is a problem with installing xorg-x11-driver-amdgpu, X won't start with errors related to unwind information not existing.I tried installing kernel-devel to make the amdgpu driver happy but it did not work out. |
Compiling the kernel without the mentioned flags above I managed to do a sleep/resume a lot longer. When X is running it still dies on 'failed to terminate hdcp ta' anyhow though. Not getting the xorg amdgpu driver to work even though I boot with older kernels. |
For those wondering how to build xen, here is my builder.conf.
|
amdgpu xorg driver now works with xorg-x11-drv-amdgpu-21.0.0-1 (https://fedora.pkgs.org/33/fedora-updates-x86_64/xorg-x11-drv-amdgpu-21.0.0-1.fc33.x86_64.rpm.html), not stable during suspend/resume or removing jitter after resume. |
Thanks for the help isodude.
|
with smt on:
|
@johnnyboy-3 do you have xorg-x11-drv-amdgpu installed? |
xorg-x11-drv-amdgpu v19.1.0-3 installed. Also tried Linux Kernel 5.14.9-1 with the same bug.
|
@johnnyboy-3 the correct kernel parameter should be nosmt btw. It's odd that xen_acpi_processor tries to send updates to XEN on thread number 2 on each processor, even though the kernel is booted with nosmt. It even says |
I think kernel doesn't have full knowledge which thread is running where, only Xen has direct access to that info. And in fact vcpu 2 of dom0 doesn't necessarily run on physical core/thread 2. This also means "nosmt" kernel option is not an effective mitigation against speculative execution bugs, when running under Xen. |
cool, so like a normal VM then. So like xen_acpi_processor trying to send up information about 16 cores can just be ignored. Trying to understand and pin down exactly what makes the amdgpu drivers flip the switch and die on me when resuming, not sure which avenues are best to visit any longer in the debugging hunt. |
I just tried out kernel 5.15-rc5 and it's still the same behavior, however I had the laptop in sleep for the whole night and it woke up fine. Still this thing with artifacts around text sometimes when text is written to the screen. I did one change though, I move away the ati_drv.so from /usr/lib64/xorg/modules/drivers, and I feel that xorg just behaves so much better now. Even though I can't read any direct differences in Xorg.0.log. I managed to suspend/resume a solid three times before amdgpu drivers giving up on SETUP_TMR command (which now is written out in the log due to the late kernel). Just a note: One thing I'm concern about is that I need to revert (PCI/MSI: Use new mask/unmask functions), somewhere between 5.15-rc1 and 5.15-rc2 it was fixed, but between rc2 and rc4 it was unfixed again. I do have to bisect this. Since amdgpu dies hard on this, maybe it's a bug in their driver that just surfaces in the new mask/unmask functions. The error that the kernel dies on this time is
I'm not sure that this is the culprit or the fact that amdgpu just fails with firmware load on resume sometimes, I've seen HDCP fail as well. I've tried to unload TMR (Trusted Memory Region) by setting CONFIG_AMD_ENCRYPT_MEM=n. TBH I don't know what Xens standpoint is about those features, maybe @marmarek knows? But in general the kernel dies on HDCP and TMR. |
Yay, latest kernel-ark with
booting with kernel options Now it actually suspends/resumes correctly. Attached is lspci -vv with Enable+ selected. |
Tried dom0 linux kernel 5.10.61 recompilation with mentioned kernel & boot options on R4.1 - no luck. |
@johnnyboy-3 I guess you need to be past the new MSI mask/unmask patches (somewhere between 5.14 and 5.15). I tried 5.12.14 and it was no go there. I did manage to get a crash, in like the 10th-15ths resume. Pretty much when the usb ports resetted. It feels like the problem may be in how the USB is done. I try to ignore 02:00.4 (the USB ports in the expansion port), but I lack the expertise to tell Xen just to ignore them. Soon I'll rip out ehci from the kernel :) Looking at my lspci log it seems that xhci and ehci got MSI disabled, but not the other AMD PCI devices. |
15h sleep with 0.277Wh, that's pretty solid for S3!
These were set but I don't think they do any difference.
Text-jitter is almost gone completely compared to before. I am going to compile 5.14.9 and see how well that fares with CONFIG_AMD_PMC=y CONFIG_HSA=AMD=n, because there's no need for disabling MSI. Then I'm going to bisect the problems with MSI in 5.15. |
Thats some good news!
Thanks for your offer but I don't think that's necessary for now. |
5.14.9 doesn't work that well out of the box, with pci=nomsi it's quirky (external screen dies sometimes, internal screen dies somtimes), but I've suspend/resumed at least 10 times now without reboot. Not how well it works in 5.15 with pci=nomsi though. This is 5.14.9 (latest qubes-linux-kernel) with
Will try to get tip booted without pci=nomsi now, that should be fun! |
I'm pessimistic! There's alot of changes between those kernels and the new ones. |
With some patches in msi drivers I got kernel 5.15 working. X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers. |
Progress, yeah! ^^
You are running a clean R4.1 RC1 or did you add/changed anything beside modified Kernel 5.15 and msi drivers? Kernel self-compiled with CONFIG_AMD_PMC=y and CONFIG_HSA_AMD=n, right? What msi patches? Anything else? I tried RC1 out of the box and with kernel-latest 5.14.10 (testing) but same issue as before, just to be sure ^^ |
builder.conf:
I don't get how I should make
unpack it, rename the folder to linux-5.15-rc5, pack it again as .tar. I'm compiling the kernel now to see if it really works with what I commited. It's quirky right now, but haven't had to reboot the system yet. |
@mcku nice find, I'll try out those patches for sure! Seems like I already had them :) |
After updating dom0 to the current stable, after casual use and suspend/resume, the screen was blank, and the device was not responsive. These are the logs that might be of interest: This one began to appear after upgrading on Feb 10:
and multiple error lines like this:
|
Today I was able to test with the tb3 dock. Unfortunately, external displays do not work with kernel version 6.1.7.
Downgraded to 6.0.12. |
Upgraded the kernel to 6.1.12 from qubes testing repo. External screens through the thunderbolt dock are functional now, but the graphics acceleration got disabled. |
On Mon, Feb 20, 2023 at 03:10:22AM -0800, Mustafa Kuscu wrote:
Upgraded the kernel to 6.1.12 from qubes testing repo. External screens through the thunderbolt dock are functional now, but the graphics acceleration got disabled.
I had problems with 6.1.12 with the external displays going away after
screenblanking due to inactivity.
…--
cheers,
Holger
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄
Humans despise their genitals so much they often use them as metaphors for
humans they dislike.
|
On Mon, Feb 20, 2023 at 12:42:56PM +0000, Holger Levsen wrote:
I had problems with 6.1.12 with the external displays going away after
screenblanking due to inactivity.
apologies, I ment 6.1.11, haven't tried 6.1.12 yet.
…--
cheers,
Holger
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄
Encryption is binary. Either something is end to end encrypted or it's not.
If there are backdoors, they will be open to anyone eventually and thus
encryption with backdoors is like there's no encryption at all.
Privacy and thus encryption are human rights.
|
Suspend works properly with xen-4.14.5-19 + kernel-latest-6.1.12-1 + linux-firmware-20230123-135 (haven't tested earlier). I'm running on an external screen (32" 4K) and it works. Would be glad if someone could replicate. If this works right now we could at least mark this issue as solved, and then figure out which commit is the good one. Please upgrade one thing at the time if you try to replicate. I will not do any further research so we have one working system at least :) Here's my journal from suspend (journalctl -b -k -p warning -o cat)
|
On Tue, Mar 14, 2023 at 10:57:01PM -0700, Josef Johansson wrote:
Suspend works properly with xen-4.14.5-19 + kernel-latest-6.1.12-1 + linux-firmware-20230123-135 (haven't tested earlier).
I only have linux-firmware-20230117-146, where did you get the newer firmwares from?
I ran glxgears in the background without a hitch.
nice.
…--
cheers,
Holger
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄
wirklicher reichtum ist nicht privatjet fliegen, sondern sich vor dem schützen
können, was privatjet fliegen auslöst." <3 böhmermann am 3.2.23
|
Just download them from the the git repo
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
, or build the package from qubes, the package basically takes upsteam
files and splits them into separate packages. These renoir files are the
interesting ones
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/amdgpu/renoir_ta.bin
|
I have been using the firmware tarball dated 20230310. Made sure it went through the initramfs. Without external monitors, everything is mostly fine, without any apparent lockups. With external displays connected (through TB dock) things are similar to my earlier experience, with the feeling that the failures are less frequent. Specifically,
Also I've managed to observe some unresponsive state a few times when playing with the various conditions, such as disconnecting TB cord when the laptop is sleeping and then resuming. |
I've realized that a firmware update was available for the dock, after applying it, the external button issue went away. |
Fresh install on a P14s gen 2, suspend worked properly. Collegue tried on a P14s gen 1, worked as well. Will do some more testing, but it's looking good. |
I agree. For me it's okay to close this issue as you have suggested earlier. |
On Wed, 29 Mar 2023 at 22:35, Mustafa Kuscu ***@***.***> wrote:
Will do some more testing, but it's looking good.
I agree. For me it's okay to close this issue as you have suggested
earlier.
—
Reply to this email directly, view it on GitHub
<#6923 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWI4MZ2SFIQCSRMZTP44DDW6SMHVANCNFSM5FDFSFGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
I think we should find out if it's possible to backport though. But I must
say that we could very well close the issue and handle the backport in
another issue.
|
Hi, I don't know what stopped me from achieving the same results, but I'm running on xen 4.14.5-20, kernel 6.2.6 and self-built linux-firmware 20230310, and amd-gpu-firmware 20230310. I also did run
I'm not getting
but I'm getting occasional
and
What could I have done wrong? Thanks! Update: With kernel 6.1.12, I'm getting slightly different error:
|
Hi
I was receiving this exact one on Mar 27 and earlier. Not happened since then. (I've upgraded linux firmware on 21st and the TB dock firmware yesterday. )
I am still getting these above, latest is today, ended with a lockup. But this only happens if I fiddle with the thunderbolt connection or dock's button during sleep.
I don't receive these errors.
I don't receive this either. |
Maybe it's a low probability incident. Low enough that might not happen for days. Sometimes I need 30 or 40 suspends to get that error. Sometimes it happens on the first suspend after reboot.
It's weird that I never see this error. Maybe that's why I get that TMR failed error. |
I found a failed suspend test with kernel 6.2.6 on an AMD system at https://openqa.qubes-os.org/tests/69638. I'm wondering if the TMR failure has actually been resolved by newer firmware? |
Increasing the timeout from 20000 to 200000 seems to provide stable suspend/resume for me. Now I think that this issue can be marked as resolved, too. If the timeout issue is specific to low-end AMD chips that cannot resume psp in 20000 us, I feel confident that I can handle that myself. |
I have issues with PVM VMs hanging on resume, I'll check if increasing timeout fixes anything |
|
It does indeed solve it. |
I'm going to do something crazy now. With the latest kernel resume suspend now works on 2 laptops here. Especially the original laptop which started this issue. The glitching issues are gone in the terminal as well. I'm writing this with an external 4K monitor connected. As times drag on issues like this tend to stay unsolved, but as they stay unsolved a certain familarity starts to arise, my laptop cannot suspend. I have to learn a new way of living, poweroff each time. Which over time becomes ingrained in my way of life. The new me, the poweroff guy. I had low watermarks where I tried to switch from Qubes to something that just works, but it felt off. Nothing's like the feeling of disp-vms all over the place and having to copy paste twice all the time. Debugging new kernels. Commiting fixes to the kernel. Reaching out to unknown people within the Xen and Kernel community. Learning about the awesome people at AMD who dedicate their time with contributions to the Kernel and dealing with all the gamers who try out AMD on Linux. This suspend/resume issue actually got me into kernel hacking, and I think it makes quite a story. My SO have shaken her head so many times about me throwing so much time at it. It's fun I say. I learn! I do not know what lies ahead, but I must say that I appriciate the work of the Qubes people keeping track of, fixing and maintaining all the pipelines. Especially a big shout out to @marmarek with his infinit wisdom :) And the joy of seeing separate people just solving odd things (like the SecureBoot/TPM). In some ways the world is a pretty dark place, but this kind of things make me smile. @SurFlurer thanks for replicating the fixes! See you in the next issue.
|
This update has apparently solved all of the issues I was experiencing. Even the power button on the external dock can resume the laptop while the laptop's lid is closed (as expected). Wow! I would like to thank those who have fixed this once more. |
Solved as of
linux-firmware-20230123-135.fc32.noarch
xen-4.14.5-20.fc32.x86_64
kernel-latest-6.2.10-1.qubes.fc32.x86_64
Qubes OS release
R4.1,
kernel 5.14.7-1 (fedora 5.14) (same behavior in lower kernels.)
XEN 4.14.3 (build from @marmarek branch)
Brief summary
Laptops does not resume after third sleep/resume cycle.
The problem seems to be with
It feels like there's a hung process in the amdgpu drivers for some reason.
Not sure how to debug this properly, XEN is not giving me much info at all.
The problem is visible with X started as well obviously but I try to make the bug surface smaller.
Steps to reproduce
Boot laptop with X disabled, no VMs started.
run systemctl suspend three times (and resuming)
run reboot to restore system
Expected behavior
Possible to suspend limitless.
Actual behavior
Screen does not wake up on third resume. It's possible to write
reboot
and restart.Notes
Works well with kernel booted without XEN.
crash.filtered.log
crash.filtered.xen.log
Workarounds
A bit more testing is needed but I do have sort of stable suspend/resume now. It even survives when everything goes south.
There's a bit of tearing, but I'd rather have suspend than tearing.
Compile
xorg-x11-drv-amdgpu
from https://github.com/freedesktop/xorg-xf86-video-amdgpuRun
make install
and installamdgpu_drv.so
in/usr/lib64/xorg/modules/drivers
on dom0.For more stability run with kernel cmdline
preempt=none
Do note that e.g. 4k external screen will be royally sluggish.
Sometimes the screen turns up black, type in the password anyhow and switch to tty2 and back again / suspend-resume again and it will most likely come to life again. Suspend/resume too fast could lead to instant reboot.
The text was updated successfully, but these errors were encountered: