Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Vulkan] amdgpu crashes with e2e_matmul_vulkan_f16_large_rdna3_vulkan-spirv_vulkan #19465

Open
kuhar opened this issue Dec 11, 2024 · 0 comments
Labels
bug 🐞 Something isn't working codegen/spirv SPIR-V code generation compiler backend hal/vulkan Runtime Vulkan GPU HAL backend

Comments

@kuhar
Copy link
Member

kuhar commented Dec 11, 2024

What happened?

The amdgpu kernel driver crashes in the e2e_matmul_vulkan_f16_large_rdna3_vulkan-spirv_vulkan test on the W7900 gpu.

sudo dmesgd
[497363.032294] traps: iree-e2e-matmul[2158766] general protection fault ip:74f917453420 sp:7fffb4a6d860 error:0 in amdvlk64.so[74f9167dc000+2951000]
[497398.585931] workqueue: pm_runtime_work hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[497534.408782] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[497534.408819] [drm] PSP is resuming...
[497534.564718] [drm] reserve 0x1300000 from 0x8b3c000000 for PSP TMR
[497534.687165] amdgpu 0000:03:00.0: amdgpu: GECC is enabled
[497534.702623] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[497534.702627] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[497534.702630] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[497534.702634] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x0000003b, smu fw program = 0, smu fw version = 0x004e5500 (78.85.0)
[497534.702636] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[497534.845431] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[497534.847728] [drm] DMUB hardware initialized: version=0x07002900
[497534.855408] [drm] kiq ring mec 3 pipe 1 q 0
[497534.860350] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[497534.860521] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[497534.861050] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[497534.861053] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[497534.861054] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[497534.861055] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[497534.861056] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[497534.861057] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[497534.861058] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[497534.861059] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[497534.861060] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[497534.861062] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[497534.861063] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[497534.861064] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[497534.861065] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[497534.861066] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[497534.861068] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[497534.863835] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes

The validation layers print some errors, but that seems unrelated since other tests pass.

This has been previously observed by @krzysz00.

Steps to reproduce your issue

Run

ctest all -j8 --output-on-failure -R e2e_matmul_vulkan_f16_large_rdna3_vulkan

What component(s) does this issue relate to?

No response

Version information

IREE:

commit 6b7ca46399f00079033d3faaf3894d1a02bdcb6f (HEAD -> main, origin/main, origin/HEAD)
Merge: 274977cae0 e3b56d7758
Author: Scott Todd <[email protected]>
Date:   Wed Dec 11 10:02:28 2024 -0800

amdvlk: 2024.Q4.1 (LLPC)

Additional context

No response

@kuhar kuhar added bug 🐞 Something isn't working codegen/spirv SPIR-V code generation compiler backend hal/vulkan Runtime Vulkan GPU HAL backend labels Dec 11, 2024
kuhar added a commit to kuhar/iree that referenced this issue Dec 11, 2024
This test is known to crash the kernel driver. Removing it until we
investigate/fix.

Issue: iree-org#19465
kuhar added a commit to kuhar/iree that referenced this issue Dec 11, 2024
This test is known to crash the kernel driver. Removing it until we
investigate/fix.

Issue: iree-org#19465

Signed-off-by: Jakub Kuderski <[email protected]>
kuhar added a commit that referenced this issue Dec 11, 2024
This test is known to crash the kernel driver. Removing it until we
investigate/fix.

Issue: #19465
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/spirv SPIR-V code generation compiler backend hal/vulkan Runtime Vulkan GPU HAL backend
Projects
None yet
Development

No branches or pull requests

1 participant