Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Screen does not wake up after resume (AMD Ryzen 7 Pro 4750U) #6923

Closed
isodude opened this issue Sep 30, 2021 · 233 comments
Closed

Screen does not wake up after resume (AMD Ryzen 7 Pro 4750U) #6923

isodude opened this issue Sep 30, 2021 · 233 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: power management C: Xen diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@isodude
Copy link

isodude commented Sep 30, 2021

Solved as of

linux-firmware-20230123-135.fc32.noarch
xen-4.14.5-20.fc32.x86_64
kernel-latest-6.2.10-1.qubes.fc32.x86_64

Qubes OS release

R4.1,
kernel 5.14.7-1 (fedora 5.14) (same behavior in lower kernels.)
XEN 4.14.3 (build from @marmarek branch)

Brief summary

Laptops does not resume after third sleep/resume cycle.
The problem seems to be with

[drm] psp command (0x7) failed and response status is (0xFFFF0007)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmp failed!

It feels like there's a hung process in the amdgpu drivers for some reason.

Not sure how to debug this properly, XEN is not giving me much info at all.
The problem is visible with X started as well obviously but I try to make the bug surface smaller.

Steps to reproduce

Boot laptop with X disabled, no VMs started.
run systemctl suspend three times (and resuming)
run reboot to restore system

Expected behavior

Possible to suspend limitless.

Actual behavior

Screen does not wake up on third resume. It's possible to write reboot and restart.

Notes

Works well with kernel booted without XEN.
crash.filtered.log
crash.filtered.xen.log

Workarounds

A bit more testing is needed but I do have sort of stable suspend/resume now. It even survives when everything goes south.
There's a bit of tearing, but I'd rather have suspend than tearing.

cat << > /etc/X11/xorg.conf.d/50-video.conf 
Section "Device"
	Identifier "card0"
	Driver "amdgpu"
	Option "AccelMethod" "none"
EndSection

Compile xorg-x11-drv-amdgpu from https://github.com/freedesktop/xorg-xf86-video-amdgpu
Run make install and install amdgpu_drv.so in /usr/lib64/xorg/modules/drivers on dom0.

For more stability run with kernel cmdline preempt=none

Do note that e.g. 4k external screen will be royally sluggish.

Sometimes the screen turns up black, type in the password anyhow and switch to tty2 and back again / suspend-resume again and it will most likely come to life again. Suspend/resume too fast could lead to instant reboot.

@isodude isodude added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Sep 30, 2021
@isodude
Copy link
Author

isodude commented Sep 30, 2021

The Xen processor (-19) from ACPI errors go away if I boot the kernel with nosmt, obviously.

In the console with lightdm never started it can survive at least 5-6 suspend-resume-cycles now.

Now compiling the kernel with
CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_HSM_AMD_SVM=n
CONFIG_AMD_MEM_ENCRYPT=n

@isodude
Copy link
Author

isodude commented Sep 30, 2021

There is a problem with installing xorg-x11-driver-amdgpu, X won't start with errors related to unwind information not existing.I tried installing kernel-devel to make the amdgpu driver happy but it did not work out.

@andrewdavidwong andrewdavidwong added C: other hardware support needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Oct 1, 2021
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Oct 1, 2021
@isodude
Copy link
Author

isodude commented Oct 1, 2021

Compiling the kernel without the mentioned flags above I managed to do a sleep/resume a lot longer.

When X is running it still dies on 'failed to terminate hdcp ta' anyhow though.

Not getting the xorg amdgpu driver to work even though I boot with older kernels.

@isodude
Copy link
Author

isodude commented Oct 1, 2021

For those wondering how to build xen, here is my builder.conf.

# Since it's a very upstream branch
INSECURE_SKIP_CHECKING = vmm-xen
GIT_URL_vmm_xen = https://github.com/marmarek/qubes-vmm-xen
BRANCH_vmm_xen = update-4.14.3
COMPONENTS = \
builder \
builder-rpm \
vmm-xen

BUILDER_PLUGINS += builder-rpm

@isodude
Copy link
Author

isodude commented Oct 1, 2021

amdgpu xorg driver now works with xorg-x11-drv-amdgpu-21.0.0-1 (https://fedora.pkgs.org/33/fedora-updates-x86_64/xorg-x11-drv-amdgpu-21.0.0-1.fc33.x86_64.rpm.html), not stable during suspend/resume or removing jitter after resume.

@johnnyboy-3
Copy link

Thanks for the help isodude.
Tried xen 4.14.3 and kernel 5.13.13 and resuming from suspend is still broken (Ryzen 2400G).
smt is off.

dom0 kernel: ------------[ cut here ]------------
dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:462 switch_mm_irqs_off+0x381/0x3a0
dom0 kernel: Modules linked in: loop nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 snd_hda_codec_realtek cfg80211 snd_hda_codec_gener>
dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.13-1.fc32.qubes.x86_64 #1
dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019
dom0 kernel: RIP: e030:switch_mm_irqs_off+0x381/0x3a0
dom0 kernel: Code: 00 00 65 48 89 05 e7 8f fa 7e e9 77 fd ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 0f 30 e9 57 fd ff ff 41 89 f6 e9 9d fe ff ff <0f> 0b e8 >
dom0 kernel: RSP: e02b:ffffc900400afeb8 EFLAGS: 00010006
dom0 kernel: RAX: 000000000ea3c000 RBX: ffff8881002c4f00 RCX: 0000000000000040
dom0 kernel: RDX: ffff8881002c4f00 RSI: 0000000000000000 RDI: ffff88808ea3c000
dom0 kernel: RBP: ffffffff829d84e0 R08: 0000000000000000 R09: 0000000000000000
dom0 kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff888100236a40
dom0 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
dom0 kernel: FS: 0000000000000000(0000) GS:ffff888127240000(0000) knlGS:0000000000000000
dom0 kernel: CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033
dom0 kernel: CR2: 00005bde388bd0e8 CR3: 0000000002810000 CR4: 0000000000050660
dom0 kernel: Call Trace:
dom0 kernel: switch_mm+0x1c/0x30
dom0 kernel: play_dead_common+0xa/0x20
dom0 kernel: xen_pv_play_dead+0xa/0x60
dom0 kernel: do_idle+0xd1/0xe0
dom0 kernel: cpu_startup_entry+0x19/0x20
dom0 kernel: asm_cpu_bringup_and_idle+0x5/0x1000
dom0 kernel: ---[ end trace 75177836fdaa3aca ]---
...
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
dom0 kernel: cpu 1 spinlock event irq 67
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
...
dom0 kernel: cpu 2 spinlock event irq 73
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
...
dom0 kernel: cpu 3 spinlock event irq 79
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
...
dom0 kernel: [drm] psp command (0x5) failed and response status is (0x0)
dom0 kernel: [drm:psp_hw_start [amdgpu]] ERROR PSP load tmr failed!
dom0 kernel: [drm:psp_resume [amdgpu]] ERROR PSP resume failed
dom0 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] ERROR resume of IP block failed -22
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
dom0 kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22
dom0 kernel: amdgpu 0000:06:00.0: PM: failed to resume async: error -22

@johnnyboy-3
Copy link

with smt on:

dom0 kernel: ------------[ cut here ]------------
dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:462 switch_mm_irqs_off+0x381/0x3a0
dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_>
dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.13-1.fc32.qubes.x86_64 #1
dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019
dom0 kernel: RIP: e030:switch_mm_irqs_off+0x381/0x3a0
dom0 kernel: Code: 00 00 65 48 89 05 e7 8f fa 7e e9 77 fd ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 0f 30 e9 57 fd ff ff 41 89 f6 e9 9d fe ff ff <0f> 0b e8 >
dom0 kernel: RSP: e02b:ffffc900400afeb8 EFLAGS: 00010006
dom0 kernel: RAX: 00000001023e0000 RBX: ffff8881002c8000 RCX: 0000000000000040
dom0 kernel: RDX: ffff8881002c8000 RSI: 0000000000000000 RDI: ffff8881823e0000
dom0 kernel: RBP: ffffffff829d84e0 R08: 0000000000000000 R09: 0000000000000000
dom0 kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffff88810a523300
dom0 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
dom0 kernel: FS: 0000000000000000(0000) GS:ffff888127240000(0000) knlGS:0000000000000000
dom0 kernel: CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033
dom0 kernel: CR2: 00007202ec011726 CR3: 0000000002810000 CR4: 0000000000050660
dom0 kernel: Call Trace:
dom0 kernel: switch_mm+0x1c/0x30
dom0 kernel: play_dead_common+0xa/0x20
dom0 kernel: xen_pv_play_dead+0xa/0x60
dom0 kernel: do_idle+0xd1/0xe0
dom0 kernel: cpu_startup_entry+0x19/0x20
dom0 kernel: asm_cpu_bringup_and_idle+0x5/0x1000
dom0 kernel: ---[ end trace 38fb75148761bdb4 ]---
...
dom0 kernel: cpu 1 spinlock event irq 67
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: cpu 2 spinlock event irq 73
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: cpu 3 spinlock event irq 79
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: cpu 4 spinlock event irq 85
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: cpu 5 spinlock event irq 91
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: cpu 6 spinlock event irq 97
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
dom0 kernel: cpu 7 spinlock event irq 103
dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
...
dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=8448, emitted seq=8450
dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 3765 thread X:cs0 pid 3839

...
dom0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring gfx test failed (-110)
dom0 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block <gfx_v9_0> failed -110
...
dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled?
dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2
dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors
...
dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered

@isodude
Copy link
Author

isodude commented Oct 1, 2021

@johnnyboy-3 do you have xorg-x11-drv-amdgpu installed?

@johnnyboy-3
Copy link

johnnyboy-3 commented Oct 1, 2021

xorg-x11-drv-amdgpu v19.1.0-3 installed.

Also tried Linux Kernel 5.14.9-1 with the same bug.
This time with new errors in journalctl on resume:

dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=9917, emitted seq=9919
dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 3038 thread X:cs0 pid 3819
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset begin!
dom0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_2.1.0 test failed (-110)
dom0 kernel: [drm] free PSP TMR buffer
dom0 kernel: [drm] psp command (0x7) failed and response status is (0x0)
dom0 kernel: [drm:psp_suspend [amdgpu]] ERROR Failed to terminate tmr
dom0 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block failed -22
dom0 kernel: ------------[ cut here ]------------
dom0 kernel: WARNING: CPU: 3 PID: 4326 at include/drm/ttm/ttm_bo_api.h:580 amdgpu_bo_unpin+0x5a/0xa0 [amdgpu]
dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 rfkill libarc4 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec intel_rapl_msr intel_rapl_common snd_hda_core snd_hwdep joydev snd_seq snd_seq_device snd_pcm snd_timer snd soundcore wmi_bmof r8169 pcspkr sp5100_tco i2c_piix4 k10temp gpio_amdpt gpio_generic wmi video xenfs fuse ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt trusted asn1_encoder amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 ccp gpu_sched i2c_algo_bit drm_kms_helper cec drm xhci_pci xhci_pci_renesas xhci_hcd xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
dom0 kernel: CPU: 3 PID: 4326 Comm: kworker/3:4 Tainted: G W 5.14.9-1.fc32.qubes.x86_64 #1
dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019
dom0 kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
dom0 kernel: RIP: e030:amdgpu_bo_unpin+0x5a/0xa0 [amdgpu]
dom0 kernel: Code: 75 25 48 8b bd 48 01 00 00 48 85 ff 74 05 e8 3d e2 5e c1 48 8b 85 c0 01 00 00 8b 40 10 83 f8 02 74 24 83 f8 01 74 0d 5b 5d c3 <0f> 0b 8b 85 04 02 00 00 eb ca 48 8b 85 30 01 00 00 f0 48 29 83 50
dom0 kernel: RSP: e02b:ffffc9004242fcb0 EFLAGS: 00010246
dom0 kernel: RAX: 0000000000000000 RBX: ffff88810d385288 RCX: 0000000000000000
dom0 kernel: RDX: ffff888013cc8000 RSI: 0000000000000000 RDI: ffff88810a717800
dom0 kernel: RBP: ffff88810a717800 R08: 0000000000000003 R09: 000000000036d488
dom0 kernel: R10: ffffc9004242fad8 R11: ffffffff82947168 R12: ffff88810d385288
dom0 kernel: R13: ffff88810a717800 R14: ffff888107321c00 R15: 0000000000000000
dom0 kernel: FS: 0000000000000000(0000) GS:ffff8881272c0000(0000) knlGS:0000000000000000
dom0 kernel: CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
dom0 kernel: CR2: 000074eab8891fb8 CR3: 0000000107554000 CR4: 0000000000050660
dom0 kernel: Call Trace:
dom0 kernel: amdgpu_gart_table_vram_unpin+0x54/0xc0 [amdgpu]
dom0 kernel: gmc_v9_0_hw_fini+0x5f/0x80 [amdgpu]
dom0 kernel: amdgpu_device_ip_suspend_phase2+0xc5/0x150 [amdgpu]
dom0 kernel: amdgpu_device_ip_suspend+0x32/0x60 [amdgpu]
dom0 kernel: amdgpu_device_pre_asic_reset+0xa8/0x250 [amdgpu]
dom0 kernel: amdgpu_device_gpu_recover.cold+0x53d/0x78e [amdgpu]
dom0 kernel: amdgpu_job_timedout+0x17a/0x1a0 [amdgpu]
dom0 kernel: drm_sched_job_timedout+0x74/0x110 [gpu_sched]
dom0 kernel: process_one_work+0x1ec/0x390
dom0 kernel: worker_thread+0x4a/0x320
dom0 kernel: ? process_one_work+0x390/0x390
dom0 kernel: kthread+0x10f/0x130
dom0 kernel: ? set_kthread_struct+0x40/0x40
dom0 kernel: ret_from_fork+0x22/0x30
dom0 kernel: ---[ end trace d480e2c68621aa89 ]---
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
dom0 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15dd
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed
dom0 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled?
dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2
dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors
dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -6
dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered

@isodude
Copy link
Author

isodude commented Oct 5, 2021

@johnnyboy-3 the correct kernel parameter should be nosmt btw. It's odd that xen_acpi_processor tries to send updates to XEN on thread number 2 on each processor, even though the kernel is booted with nosmt. It even says SMT: Disabled in boot.

@marmarek
Copy link
Member

marmarek commented Oct 5, 2021

I think kernel doesn't have full knowledge which thread is running where, only Xen has direct access to that info. And in fact vcpu 2 of dom0 doesn't necessarily run on physical core/thread 2. This also means "nosmt" kernel option is not an effective mitigation against speculative execution bugs, when running under Xen.

@isodude
Copy link
Author

isodude commented Oct 5, 2021

cool, so like a normal VM then. So like xen_acpi_processor trying to send up information about 16 cores can just be ignored.

Trying to understand and pin down exactly what makes the amdgpu drivers flip the switch and die on me when resuming, not sure which avenues are best to visit any longer in the debugging hunt.

@isodude
Copy link
Author

isodude commented Oct 6, 2021

I just tried out kernel 5.15-rc5 and it's still the same behavior, however I had the laptop in sleep for the whole night and it woke up fine. Still this thing with artifacts around text sometimes when text is written to the screen.

I did one change though, I move away the ati_drv.so from /usr/lib64/xorg/modules/drivers, and I feel that xorg just behaves so much better now. Even though I can't read any direct differences in Xorg.0.log. I managed to suspend/resume a solid three times before amdgpu drivers giving up on SETUP_TMR command (which now is written out in the log due to the late kernel).

Just a note: One thing I'm concern about is that I need to revert (PCI/MSI: Use new mask/unmask functions), somewhere between 5.15-rc1 and 5.15-rc2 it was fixed, but between rc2 and rc4 it was unfixed again. I do have to bisect this. Since amdgpu dies hard on this, maybe it's a bug in their driver that just surfaces in the new mask/unmask functions.

The error that the kernel dies on this time is

[drm] psp gfx command SETUP_TMR(0x5) failed and response status is (0x0)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed

I'm not sure that this is the culprit or the fact that amdgpu just fails with firmware load on resume sometimes, I've seen HDCP fail as well. I've tried to unload TMR (Trusted Memory Region) by setting CONFIG_AMD_ENCRYPT_MEM=n. TBH I don't know what Xens standpoint is about those features, maybe @marmarek knows? But in general the kernel dies on HDCP and TMR.

@isodude
Copy link
Author

isodude commented Oct 6, 2021

Yay, latest kernel-ark with

CONFIG_SND_SOC_AMD_RENOIR=n
CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n
CONFIG_HSA_AMD_SVM=n
CONFIG_AMD_MEM_ENCRYPT=n

booting with kernel options pci=nomsi

Now it actually suspends/resumes correctly.

Attached is lspci -vv with Enable+ selected.
lscpi-msi.log

@johnnyboy-3
Copy link

Tried dom0 linux kernel 5.10.61 recompilation with mentioned kernel & boot options on R4.1 - no luck.

@isodude
Copy link
Author

isodude commented Oct 6, 2021

@johnnyboy-3 I guess you need to be past the new MSI mask/unmask patches (somewhere between 5.14 and 5.15). I tried 5.12.14 and it was no go there.
I can update my linux-kernel-tree if you'd like.

I did manage to get a crash, in like the 10th-15ths resume. Pretty much when the usb ports resetted. It feels like the problem may be in how the USB is done. I try to ignore 02:00.4 (the USB ports in the expansion port), but I lack the expertise to tell Xen just to ignore them. Soon I'll rip out ehci from the kernel :)

Looking at my lspci log it seems that xhci and ehci got MSI disabled, but not the other AMD PCI devices.

@isodude
Copy link
Author

isodude commented Oct 8, 2021

15h sleep with 0.277Wh, that's pretty solid for S3!
5.15 Worked with these kernel configs and pci=nomsi.

CONFIG_AMD_PMC=y
CONFIG_HSA_AMD=n

These were set but I don't think they do any difference.

CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n

Text-jitter is almost gone completely compared to before.

I am going to compile 5.14.9 and see how well that fares with CONFIG_AMD_PMC=y CONFIG_HSA=AMD=n, because there's no need for disabling MSI. Then I'm going to bisect the problems with MSI in 5.15.

@johnnyboy-3
Copy link

Thats some good news!

I can update my linux-kernel-tree if you'd like.

Thanks for your offer but I don't think that's necessary for now.
I wonder if this problem can be fixed on older kernels in Qubes R4.0 too.

@isodude
Copy link
Author

isodude commented Oct 9, 2021

5.14.9 doesn't work that well out of the box, with pci=nomsi it's quirky (external screen dies sometimes, internal screen dies somtimes), but I've suspend/resumed at least 10 times now without reboot. Not how well it works in 5.15 with pci=nomsi though.

This is 5.14.9 (latest qubes-linux-kernel) with

CONFIG_AMD_PMC=y
CONFIG_HSA_AMD=n

Will try to get tip booted without pci=nomsi now, that should be fun!

@isodude
Copy link
Author

isodude commented Oct 9, 2021

Thanks for your offer but I don't think that's necessary for now. I wonder if this problem can be fixed on older kernels in Qubes R4.0 too.

I'm pessimistic! There's alot of changes between those kernels and the new ones.

@isodude
Copy link
Author

isodude commented Oct 12, 2021

With some patches in msi drivers I got kernel 5.15 working.

X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers.

@bigdx
Copy link

bigdx commented Oct 13, 2021

Progress, yeah! ^^

With some patches in msi drivers I got kernel 5.15 working.

X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers.

You are running a clean R4.1 RC1 or did you add/changed anything beside modified Kernel 5.15 and msi drivers? Kernel self-compiled with CONFIG_AMD_PMC=y and CONFIG_HSA_AMD=n, right? What msi patches? Anything else?

I tried RC1 out of the box and with kernel-latest 5.14.10 (testing) but same issue as before, just to be sure ^^

@isodude
Copy link
Author

isodude commented Oct 14, 2021

builder.conf:

GIT_URL_linux_kernel = https://github.com/isodude/qubes-linux-kernel
BRANCH_linux_kernel = devel-5.15

I don't get how I should make make get-sources work properly, but I download it manually instead.

wget https://gitlab.com/cki-project/kernel-ark/-/archive/v5.15-rc5/kernel-ark-v5.15-rc5.tar.bz2

unpack it, rename the folder to linux-5.15-rc5, pack it again as .tar.

I'm compiling the kernel now to see if it really works with what I commited. It's quirky right now, but haven't had to reboot the system yet.

@isodude
Copy link
Author

isodude commented Jan 31, 2023

@mcku nice find, I'll try out those patches for sure! Seems like I already had them :)

@mcku
Copy link

mcku commented Feb 11, 2023

After updating dom0 to the current stable, after casual use and suspend/resume, the screen was blank, and the device was not responsive. These are the logs that might be of interest:

This one began to appear after upgrading on Feb 10:

[drm:amdgpu_register_gpu_instance [amdgpu]] *ERROR* Cannot register more gpu instance

and multiple error lines like this:

amdgpu 0000:07:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.1.0 (-110).

@mcku
Copy link

mcku commented Feb 15, 2023

Today I was able to test with the tb3 dock. Unfortunately, external displays do not work with kernel version 6.1.7.

[drm:dc_link_allocate_mst_payload [amdgpu]] *ERROR* Failure: pbn_per_slot==0 not allowed. Cannot continue, returning DC_UNSUPPORTED_VALUE.

Downgraded to 6.0.12.

@mcku
Copy link

mcku commented Feb 20, 2023

Upgraded the kernel to 6.1.12 from qubes testing repo. External screens through the thunderbolt dock are functional now, but the graphics acceleration got disabled.

@h01ger
Copy link

h01ger commented Feb 20, 2023 via email

@h01ger
Copy link

h01ger commented Feb 20, 2023 via email

@isodude
Copy link
Author

isodude commented Mar 15, 2023

Suspend works properly with xen-4.14.5-19 + kernel-latest-6.1.12-1 + linux-firmware-20230123-135 (haven't tested earlier).
I ran glxgears in the background without a hitch.
The garbled text in console appears, but it dissapears just as quickly. (Is it the case that after suspend one of the cores on the GPU is never spun up to proper speed?)

I'm running on an external screen (32" 4K) and it works.

Would be glad if someone could replicate. If this works right now we could at least mark this issue as solved, and then figure out which commit is the good one. Please upgrade one thing at the time if you try to replicate. I will not do any further research so we have one working system at least :)

Here's my journal from suspend (journalctl -b -k -p warning -o cat)

ACPI BIOS Warning (bug): Incorrect checksum in table [BGRT] - 0x65, should be 0x9D (20220331/tbprint-174)
cpu 0 spinlock event irq 57
cpu 1 spinlock event irq 67
cpu 2 spinlock event irq 73
cpu 3 spinlock event irq 79
cpu 4 spinlock event irq 85
cpu 5 spinlock event irq 91
cpu 6 spinlock event irq 97
cpu 7 spinlock event irq 103
Grant table initialized
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
sysfb: VRAM smaller than advertised
ccp 0000:04:00.2: tee: ring init command failed (0x00000005)
ccp 0000:04:00.2: tee: failed to init ring buffer
nvme nvme0: missing or invalid SUBNQN field.
amdgpu 0000:04:00.0: amdgpu: PSP runtime database doesn't exist
amdgpu 0000:04:00.0: amdgpu: PSP runtime database doesn't exist
[drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[drm] DP Alt mode state on HPD: 1
amdgpu: SRAT table not found
Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO: IRQ index 1 not found
Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO: Error requesting irq at index 1
ioremap error for 0xfed80000-0xfed81000, requested 0x2, got 0x0
piix4_smbus 0000:00:14.0: SMBus base address mapping failed.
piix4_smbus: probe of 0000:00:14.0 failed with error -12
ioremap error for 0xfed80000-0xfed81000, requested 0x2, got 0x0
sp5100-tco sp5100-tco: Address mapping failed
sp5100-tco: probe of sp5100-tco failed with error -12
platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:523 switch_mm_irqs_off+0x230/0x4a0
Modules linked in: snd_seq_dummy snd_hrtimer nf_tables nfnetlink vfat fat snd_soc_dmic snd_acp3x_rn snd_acp3x_pdm_dma snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci sn>
 crct10dif_pclmul crc32_pclmul drm_ttm_helper crc32c_intel polyval_clmulni ttm polyval_generic ucsi_acpi ghash_clmulni_intel sha512_ssse3 typec_ucsi serio_raw iommu_v2 ccp nvme thinkpad_acp>
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.1.12-1.qubes.fc32.x86_64 #1
Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET72W(1.41 ) 06/27/2022
RIP: e030:switch_mm_irqs_off+0x230/0x4a0
Code: 48 01 ca 0f 82 84 02 00 00 48 c7 c1 00 00 00 80 48 2b 0d 53 18 99 01 48 01 ca 48 0b 15 41 08 bc 01 48 39 c2 0f 84 84 fe ff ff <0f> 0b e8 f9 fa ff ff e9 78 fe ff ff 0f b7 cb 0f b7 c3 4>
RSP: e02b:ffffc90040107ec0 EFLAGS: 00010006
RAX: 0000000002c10000 RBX: ffff888140840000 RCX: 0000777f80000000
RDX: 00000001023c2000 RSI: ffffffff82964cf1 RDI: ffffffff8291d7c0
RBP: ffffffff82dd0340 R08: 0000000000000008 R09: fffffffffffffffe
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810d4aa200
R13: ffff88810036d100 R14: 0000000000000001 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff888140840000(0000) knlGS:0000000000000000
CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00007cfa1ffc0408 CR3: 0000000002c10000 CR4: 0000000000050660
Call Trace:
 <TASK>
 switch_mm+0x1a/0x30
 play_dead_common+0xa/0x20
 xen_pv_play_dead+0xa/0x50
 do_idle+0xd2/0xe0
 cpu_startup_entry+0x19/0x20
 cpu_bringup_and_idle+0x14/0x20
 asm_cpu_bringup_and_idle+0x5/0x10
 </TASK>
---[ end trace 0000000000000000 ]---
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU9
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU11
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU13
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU15
cpu 1 spinlock event irq 67
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 2 spinlock event irq 73
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 3 spinlock event irq 79
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 4 spinlock event irq 85
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 5 spinlock event irq 91
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 6 spinlock event irq 97
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 7 spinlock event irq 103
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[drm] DP Alt mode state on HPD: 1
[drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU9
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU11
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU13
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU15
cpu 1 spinlock event irq 67
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 2 spinlock event irq 73
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 3 spinlock event irq 79
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 4 spinlock event irq 85
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 5 spinlock event irq 91
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 6 spinlock event irq 97
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 7 spinlock event irq 103
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[drm] DP Alt mode state on HPD: 1

@h01ger
Copy link

h01ger commented Mar 15, 2023 via email

@isodude
Copy link
Author

isodude commented Mar 15, 2023 via email

@mcku
Copy link

mcku commented Mar 27, 2023

I have been using the firmware tarball dated 20230310. Made sure it went through the initramfs. Without external monitors, everything is mostly fine, without any apparent lockups.

With external displays connected (through TB dock) things are similar to my earlier experience, with the feeling that the failures are less frequent. Specifically,

  • if I use the external dock's power button to un-suspend, the laptop gets a reset immediately and Lenovo logo displays on the screen, but the device can't boot thereafter, keeps stuck at there.
  • if I open the lid to un-suspend, then the device resumes fine, sometimes.

Also I've managed to observe some unresponsive state a few times when playing with the various conditions, such as disconnecting TB cord when the laptop is sleeping and then resuming.

@mcku
Copy link

mcku commented Mar 29, 2023

if I use the external dock's power button to un-suspend, the laptop gets a reset immediately and Lenovo logo displays on the screen, but the device can't boot thereafter, keeps stuck at there.

I've realized that a firmware update was available for the dock, after applying it, the external button issue went away.

@isodude
Copy link
Author

isodude commented Mar 29, 2023

Fresh install on a P14s gen 2, suspend worked properly.

Collegue tried on a P14s gen 1, worked as well.

Will do some more testing, but it's looking good.

@mcku
Copy link

mcku commented Mar 29, 2023

Will do some more testing, but it's looking good.

I agree. For me it's okay to close this issue as you have suggested earlier.

@isodude
Copy link
Author

isodude commented Mar 29, 2023 via email

@SurFlurer
Copy link

SurFlurer commented Mar 30, 2023

Hi, I don't know what stopped me from achieving the same results, but I'm running on xen 4.14.5-20, kernel 6.2.6 and self-built linux-firmware 20230310, and amd-gpu-firmware 20230310. I also did run mkinitrd.
I still get psp resume errors.

Mar 30 17:39:17 dom0 kernel: amdgpu 0000:06:00.0: PM: failed to resume async: error -22
Mar 30 17:39:17 dom0 kernel: amdgpu 0000:06:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22
Mar 30 17:39:17 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
Mar 30 17:39:17 dom0 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
Mar 30 17:39:17 dom0 kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Mar 30 17:39:17 dom0 kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
Mar 30 17:39:17 dom0 kernel: [drm] psp gfx command SETUP_TMR(0x5) failed and response status is (0x0)

I'm not getting

[drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[drm] DP Alt mode state on HPD: 1

but I'm getting occasional

Mar 30 17:37:24 dom0 kernel: [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
Mar 30 17:37:24 dom0 kernel: [drm] failed to load ucode VCN0_RAM(0x3A)

and

Mar 30 17:35:37 dom0 kernel: [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x0)

What could I have done wrong? Thanks!

Update: With kernel 6.1.12, I'm getting slightly different error:

Mar 30 18:17:28 dom0 kernel: amdgpu 0000:06:00.0: PM: failed to resume async: error -22
Mar 30 18:17:28 dom0 kernel: amdgpu 0000:06:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22
Mar 30 18:17:28 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
Mar 30 18:17:28 dom0 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
Mar 30 18:17:28 dom0 kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Mar 30 18:17:28 dom0 kernel: [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
Mar 30 18:17:28 dom0 kernel: [drm] failed to load ucode SDMA0(0x1) 

@mcku
Copy link

mcku commented Mar 30, 2023

Hi

Mar 30 17:39:17 dom0 kernel: amdgpu 0000:06:00.0: PM: failed to resume async: error -22
Mar 30 17:39:17 dom0 kernel: amdgpu 0000:06:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22
Mar 30 17:39:17 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
Mar 30 17:39:17 dom0 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
Mar 30 17:39:17 dom0 kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Mar 30 17:39:17 dom0 kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
Mar 30 17:39:17 dom0 kernel: [drm] psp gfx command SETUP_TMR(0x5) failed and response status is (0x0)

I was receiving this exact one on Mar 27 and earlier. Not happened since then. (I've upgraded linux firmware on 21st and the TB dock firmware yesterday. )

I'm not getting

[drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
amdgpu 0000:04:00.0: amdgpu: Secure display: Generic Failure.
amdgpu 0000:04:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[drm] DP Alt mode state on HPD: 1

I am still getting these above, latest is today, ended with a lockup. But this only happens if I fiddle with the thunderbolt connection or dock's button during sleep.

but I'm getting occasional

Mar 30 17:37:24 dom0 kernel: [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
Mar 30 17:37:24 dom0 kernel: [drm] failed to load ucode VCN0_RAM(0x3A)

I don't receive these errors.

and

Mar 30 17:35:37 dom0 kernel: [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x0)

I don't receive this either.

@SurFlurer
Copy link

I was receiving this exact one on Mar 27 and earlier. Not happened since then.

Maybe it's a low probability incident. Low enough that might not happen for days. Sometimes I need 30 or 40 suspends to get that error. Sometimes it happens on the first suspend after reboot.

I am still getting these above, latest is today, ended with a lockup.

It's weird that I never see this error. Maybe that's why I get that TMR failed error.

@SurFlurer
Copy link

I found a failed suspend test with kernel 6.2.6 on an AMD system at https://openqa.qubes-os.org/tests/69638.

I'm wondering if the TMR failure has actually been resolved by newer firmware?

@SurFlurer
Copy link

Since this failure seems totally random ( and keeps happening across this issue, also raised up at #8142 ), I doubt this timeout for psp resume is really sufficient for our scenario. However I'm currently having difficulties to compile a kernel to test.

@SurFlurer
Copy link

Increasing the timeout from 20000 to 200000 seems to provide stable suspend/resume for me.

Now I think that this issue can be marked as resolved, too.

If the timeout issue is specific to low-end AMD chips that cannot resume psp in 20000 us, I feel confident that I can handle that myself.

@isodude
Copy link
Author

isodude commented Apr 15, 2023

I have issues with PVM VMs hanging on resume, I'll check if increasing timeout fixes anything

@marmarek
Copy link
Member

I have issues with PVM VMs hanging on resume, I'll check if increasing timeout fixes anything

#8139 (comment) ?

@isodude
Copy link
Author

isodude commented Apr 15, 2023

I have issues with PVM VMs hanging on resume, I'll check if increasing timeout fixes anything

#8139 (comment) ?

It does indeed solve it.

@isodude
Copy link
Author

isodude commented Apr 16, 2023

I'm going to do something crazy now. With the latest kernel resume suspend now works on 2 laptops here. Especially the original laptop which started this issue. The glitching issues are gone in the terminal as well. I'm writing this with an external 4K monitor connected.

As times drag on issues like this tend to stay unsolved, but as they stay unsolved a certain familarity starts to arise, my laptop cannot suspend. I have to learn a new way of living, poweroff each time. Which over time becomes ingrained in my way of life. The new me, the poweroff guy. I had low watermarks where I tried to switch from Qubes to something that just works, but it felt off. Nothing's like the feeling of disp-vms all over the place and having to copy paste twice all the time. Debugging new kernels. Commiting fixes to the kernel. Reaching out to unknown people within the Xen and Kernel community. Learning about the awesome people at AMD who dedicate their time with contributions to the Kernel and dealing with all the gamers who try out AMD on Linux.

This suspend/resume issue actually got me into kernel hacking, and I think it makes quite a story. My SO have shaken her head so many times about me throwing so much time at it. It's fun I say. I learn!

I do not know what lies ahead, but I must say that I appriciate the work of the Qubes people keeping track of, fixing and maintaining all the pipelines. Especially a big shout out to @marmarek with his infinit wisdom :) And the joy of seeing separate people just solving odd things (like the SecureBoot/TPM). In some ways the world is a pretty dark place, but this kind of things make me smile. @SurFlurer thanks for replicating the fixes!

See you in the next issue.

$ journalctl -e -p warning -k -b -o cat
Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO: IRQ index 1 not found
Serial bus multi instantiate pseudo device driver INT3515:00: error -ENXIO: Error requesting irq at index 1
platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
ipmi_si: Unable to find any System Interface(s)
pciback 0000:06:00.3: xen_pciback: enabling permissive mode configuration space accesses!
pciback 0000:06:00.3: xen_pciback: permissive mode is potentially unsafe!
pciback 0000:05:00.0: xen_pciback: enabling permissive mode configuration space accesses!
pciback 0000:05:00.0: xen_pciback: permissive mode is potentially unsafe!
pciback 0000:06:00.4: xen_pciback: enabling permissive mode configuration space accesses!
pciback 0000:06:00.4: xen_pciback: permissive mode is potentially unsafe!
pciback 0000:03:00.0: xen_pciback: enabling permissive mode configuration space accesses!
pciback 0000:03:00.0: xen_pciback: permissive mode is potentially unsafe!
pciback 0000:04:00.0: xen_pciback: enabling permissive mode configuration space accesses!
pciback 0000:04:00.0: xen_pciback: permissive mode is potentially unsafe!
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
gnttab_map_refs: 272737 callbacks suppressed
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
gnttab_map_refs: 13469 callbacks suppressed
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
xen:grant_table: maptrack limit reached, can't map all guest pages
[drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:523 switch_mm_irqs_off+0x230/0x4a0
Modules linked in: loop snd_seq_dummy snd_hrtimer nf_tables nfnetlink vfat fat snd_acp3x_pdm_dma snd_soc_dmic snd_acp3x_rn snd_sof_amd_rembrandt snd_>
 dm_persistent_data dm_bio_prison dm_crypt hid_multitouch amdgpu drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni iommu_>
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.2.10-1.qubes.fc32.x86_64 #1
Hardware name: LENOVO 20Y1S02400/20Y1S02400, BIOS R1BET72W(1.41 ) 06/27/2022
RIP: e030:switch_mm_irqs_off+0x230/0x4a0
Code: 48 01 ca 0f 82 84 02 00 00 48 c7 c1 00 00 00 80 48 2b 0d 53 d8 90 01 48 01 ca 48 0b 15 a1 aa ba 01 48 39 c2 0f 84 84 fe ff ff <0f> 0b e8 a9 fa >
RSP: e02b:ffffc90040107ec0 EFLAGS: 00010006
RAX: 0000000002c10000 RBX: ffff888140840000 RCX: 0000777f80000000
RDX: 000000010c22c000 RSI: ffffffff828f46ac RDI: ffffffff828acf70
RBP: ffffffff82dcb080 R08: 0000000000000008 R09: fffffffffffffffe
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810c540980
R13: ffff888100368000 R14: 0000000000000001 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff888140840000(0000) knlGS:0000000000000000
CS:  10000e030 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00007d8ea8368000 CR3: 0000000002c10000 CR4: 0000000000050660
Call Trace:
 <TASK>
 switch_mm+0x1a/0x30
 play_dead_common+0xa/0x20
 xen_pv_play_dead+0xa/0x50
 do_idle+0xd2/0xe0
 cpu_startup_entry+0x19/0x20
 cpu_bringup_and_idle+0x14/0x20
 asm_cpu_bringup_and_idle+0x5/0x10
 </TASK>
---[ end trace 0000000000000000 ]---
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU9
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU11
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU13
xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU15
cpu 1 spinlock event irq 67
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 2 spinlock event irq 73
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 3 spinlock event irq 79
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 4 spinlock event irq 85
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 5 spinlock event irq 91
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 6 spinlock event irq 97
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
cpu 7 spinlock event irq 103
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
[drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
amdgpu 0000:06:00.0: amdgpu: Secure display: Generic Failure.
amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
[drm] DP Alt mode state on HPD: 1
$ rpm -q linux-firmware xen kernel-latest
linux-firmware-20230123-135.fc32.noarch
xen-4.14.5-20.fc32.x86_64
kernel-latest-6.2.10-1.qubes.fc32.x86_64

@isodude isodude closed this as completed Apr 16, 2023
@mcku
Copy link

mcku commented Apr 17, 2023

This update has apparently solved all of the issues I was experiencing. Even the power button on the external dock can resume the laptop while the laptop's lid is closed (as expected).
Amazing :-)

Wow! I would like to thank those who have fixed this once more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: power management C: Xen diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: major Priority: major. Between "default" and "critical" in severity. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

15 participants