Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System sometimes reboots on resume - AMDGPU issue? #8142

Closed
marmarek opened this issue Apr 15, 2023 · 5 comments · Fixed by QubesOS/qubes-linux-kernel#768
Closed

System sometimes reboots on resume - AMDGPU issue? #8142

marmarek opened this issue Apr 15, 2023 · 5 comments · Fixed by QubesOS/qubes-linux-kernel#768
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: kernel diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. pr submitted A pull request has been submitted for this issue. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Milestone

Comments

@marmarek
Copy link
Member

Observation

openQA test in scenario qubesos-4.2-pull-requests-x86_64-system_tests_suspend@hw1 fails in
suspend

[  419.980876] [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR
[  422.259183] [drm] psp gfx command SETUP_TMR(0x5) failed and response status is (0x0)
[  422.259189] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
[  422.259818] [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[  422.260388] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -22
[  422.260915] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
[  422.260919] amdgpu 0000:05:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22
[  422.260938] amdgpu 0000:05:00.0: PM: failed to resume async: error -22
[  422.271074] PM: resume devices took 2.910 seconds
[  422.272998] OOM killer enabled.
[  422.273007] Restarting tasks ... done.
[  422.276033] random: crng reseeded on system resumption
[  422.289145] PM: suspend exit

and later:

[  438.852562] amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow start
[  438.852574] amdgpu 0000:05:00.0: amdgpu: recover vram bo from shadow done
[  438.852596] amdgpu 0000:05:00.0: amdgpu: GPU reset(2) succeeded!
[  439.353495] [drm] Fence fallback timer expired on ring sdma0
[  439.353541] [drm] Fence fallback timer expired on ring gfx_low
[  439.353601] [drm] Skip scheduling IBs!
(XEN) AMD-Vi: IO_PAGE_FAULT: 0000:05:00.0 d0 addr 000000cfe2d72000 flags 0x10 PR
[  439.354182] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:173 vmid:6 pasid:32770, for process Xorg pid 4417 thread X:cs0 pid 4497)
[  439.354204] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800104ac0000 from IH client 0x1b (UTCL2)
[  439.354219] amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x0064115B
[  439.354227] amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  439.354236] amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  439.354243] amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x5
[  439.354250] amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
[  439.354257] amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x1
[  439.354264] amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
[  439.354273] amdgpu 0000:05:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:173 vmid:6 pasid:32770, for process Xorg pid 4417 thread X:cs0 pid 4497)
[  439.354287] amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x0000800104ac0000 from IH client 0x1b (UTCL2)
(and so on)

I don't see explicit panic, but the system eventually reboots.

This issue happens only sometimes.

Test suite description

Perform S3 on a AMD Renoir-based laptop.

Reproducible

Fails since (at least) Build 2023041413-4.2 (current job)

Expected result

Last good: 2023041315-4.2 (or more recent)

Further details

Always latest result in this scenario: latest

@marmarek marmarek added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: kernel labels Apr 15, 2023
@marmarek marmarek added this to the Release 4.2 milestone Apr 15, 2023
@isodude
Copy link

isodude commented Apr 15, 2023

My new laptop with no extra RAM sticks does not serm to suffer from this. SEEMS :)

@SurFlurer
Copy link

SurFlurer commented Apr 15, 2023

I doubt this timeout for psp resume is really sufficient for our scenario.

I tested with that timeout increased aggressively by 10 times ( to 200000 us, I guess ), and all those "psp gfx command XXXX" failure seem to go away.

It DOES seems that suspend and resume dom0 on Xen is slow, compared to suspend and resume a baremetal linux on the same machine.

@andrewdavidwong andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. hardware support labels Apr 15, 2023
@Augsch123
Copy link

2023041823-4.2 showed another non-fatal error message:

[  425.229052] [drm] failed to load ucode VCN0_RAM(0x3A) 
[  425.229057] [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)

Also this comment. Maybe timeout is actually what makes a difference.

@Augsch123
Copy link

Augsch123 commented May 6, 2023

2023042706-4.1 and 2023050602-4.2repeated the above LOAD_IP_FW(0x6) error.

@Augsch123
Copy link

A fatal error occurred on https://openqa.qubes-os.org/tests/73993. This is more explicitly related to timeout.

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). pr submitted A pull request has been submitted for this issue. and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Jun 1, 2023
@andrewdavidwong andrewdavidwong added the affects-4.2 This issue affects Qubes OS 4.2. label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: kernel diagnosed Technical diagnosis has been performed (see issue comments). hardware support P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. pr submitted A pull request has been submitted for this issue. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants