Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" #17894

ScottTodd · 2024-07-12T21:06:52Z

This broke SDXL rocm pipeline tests on mi300, see #17847 (comment). The tests aren't showing error messages (root:benchmark_sdxl_rocm.py:31 Command failed with error: b'') so I can't easily tell what the issue is, nod-ai/SHARK-TestSuite#286 is filed to improve the situation there.

This reverts commit d65c6d4.

ScottTodd · 2024-07-12T21:09:44Z

Going to see if I can get logs locally while the CI sanity check runs on this.

ScottTodd · 2024-07-12T21:29:20Z

Hmm... seems like compilation succeeded but benchmarking failed. Possibly a runtime error from the driver if unsupported instructions were included or something. I don't have an easy way to check right now.

pashu123 · 2024-07-15T15:04:28Z

@ScottTodd Do you have instruction to repro this?

ScottTodd · 2024-07-15T15:27:23Z

@ScottTodd Do you have instruction to repro this?

The logs should (ideally) provide enough info: https://github.com/iree-org/iree/actions/runs/9911235296/job/27383808262#step:16:46

INFO root:benchmark_sdxl_rocm.py:25 ('Exec:', 'iree-compile /home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/benchmarks/sdxl/sdxl_pipeline_bench_f16.mlir --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-global-opt-propagate-transposes=true --iree-codegen-llvmgpu-use-vector-distribution --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-opt-outer-dim-concat=true --iree-llvmgpu-enable-prefetch -o /home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/benchmarks/sdxl/sdxl_full_pipeline_fp16_rocm.vmfb')
INFO root:benchmark_sdxl_rocm.py:25 ('Exec:', 'iree-benchmark-module --device=hip://0 --device_allocator=caching --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-prompt-encoder-tank/model_gpu_rocm_real_weights.vmfb --parameters=model=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-prompt-encoder-tank/real_weights.irpa --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/model_gpu_rocm_real_weights.vmfb --parameters=model=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/real_weights.irpa --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-vae-decode-tank/model_gpu_rocm_real_weights.vmfb --parameters=model=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-vae-decode-tank/real_weights.irpa --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/benchmarks/sdxl/sdxl_full_pipeline_fp16_rocm.vmfb --function=tokens_to_image --input=1x4x128x128xf16 --input=1xf16 --input=1x64xi64 --input=1x64xi64 --input=1x64xi64 --input=1x64xi64 --benchmark_repetitions=10 --benchmark_min_warmup_time=3.0')

Source files might be split between https://github.com/nod-ai/SHARK-TestSuite/tree/4486c44a3d9e61dd20317fe1d23be71ff1610f32/iree_tests/benchmarks/sdxl and https://github.com/nod-ai/SHARK-TestSuite/tree/4486c44a3d9e61dd20317fe1d23be71ff1610f32/iree_tests/pytorch/models

ScottTodd · 2024-07-15T17:40:15Z

I was also planning on fixing the script to print the right error logs and then getting a CI run with that PR again to check that logs appear.

…iree-org#17894)" This reverts commit 02c2000.

ScottTodd · 2024-07-15T18:01:47Z

Trying to get more info with #17907

ScottTodd · 2024-07-15T18:21:57Z

https://github.com/iree-org/iree/actions/runs/9944277339/job/27470515222?pr=17907#step:7:171

popenargs = (['iree-run-module', '--device=hip', '--module=/home/esaimana/iree_tests_cache/artifacts/sdxl_unet/model.rocm_gfx942.v...', '--module=/home/esaimana/iree_tests_cache/artifacts/sdxl_unet/sdxl_unet_pipeline_bench_f16.rocm_gfx942.vmfb', ...],)
kwargs = {'cwd': PosixPath('/home/esaimana/iree_tests_cache/artifacts/sdxl_unet'), 'stderr': -1, 'stdout': -1}
process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...>
stdout = b''
stderr = b"iree/runtime/src/iree/hal/drivers/hip/native_executable.c:186: INTERNAL; HIP driver error 'hipErrorNoBinaryForGpu' (...e/stable_diffusion_xl_base_1_0_PNDM_64_1024x1024_fp16_unet_3.mlir:1717:10; creating VM context; creating run context\n"

…ruction" (iree-org#17894)"" This reverts commit 13c11b7. (We must go deeper)

…iree-org#17894)" This reverts commit 02c2000.

#17921) … (#17894)" This reverts commit 02c2000.

…rg#17894) Reverts iree-org#17847 This broke SDXL rocm pipeline tests on mi300, see iree-org#17847 (comment). The tests aren't showing error messages (`root:benchmark_sdxl_rocm.py:31 Command failed with error: b''`) so I can't easily tell what the issue is, nod-ai/SHARK-TestSuite#286 is filed to improve the situation there. Signed-off-by: Lubo Litchev <[email protected]>

iree-org#17921) … (iree-org#17894)" This reverts commit 02c2000. Signed-off-by: Lubo Litchev <[email protected]>

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction (#17847)"

7877652

This reverts commit d65c6d4.

ScottTodd requested review from kuhar, antiagainst and qedawkins as code owners July 12, 2024 21:06

ScottTodd requested a review from pashu123 July 12, 2024 21:07

kuhar approved these changes Jul 12, 2024

View reviewed changes

ScottTodd merged commit 02c2000 into main Jul 12, 2024
55 checks passed

ScottTodd deleted the revert-17847-wmma_ab_f32_c_f32 branch July 12, 2024 21:33

ScottTodd added a commit to ScottTodd/iree that referenced this pull request Jul 15, 2024

Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" (…

13c11b7

…iree-org#17894)" This reverts commit 02c2000.

ScottTodd added a commit to ScottTodd/iree that referenced this pull request Jul 15, 2024

Revert "Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 inst…

cba7594

…ruction" (iree-org#17894)"" This reverts commit 13c11b7. (We must go deeper)

pashu123 added a commit to pashu123/iree that referenced this pull request Jul 16, 2024

Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" (…

38a7d56

…iree-org#17894)" This reverts commit 02c2000.

pashu123 mentioned this pull request Jul 16, 2024

Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction"… #17921

Merged

pashu123 added a commit to pashu123/iree that referenced this pull request Jul 19, 2024

Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" (…

446f5b9

…iree-org#17894)" This reverts commit 02c2000.

pashu123 added a commit that referenced this pull request Jul 24, 2024

Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction"… (

ae00c4f

#17921) … (#17894)" This reverts commit 02c2000.

LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024

Revert "Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction"… (

cea7ec4

iree-org#17921) … (iree-org#17894)" This reverts commit 02c2000. Signed-off-by: Lubo Litchev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" #17894

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" #17894

ScottTodd commented Jul 12, 2024

ScottTodd commented Jul 12, 2024

ScottTodd commented Jul 12, 2024

pashu123 commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" #17894

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" #17894

Conversation

ScottTodd commented Jul 12, 2024

ScottTodd commented Jul 12, 2024

ScottTodd commented Jul 12, 2024

pashu123 commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024

ScottTodd commented Jul 15, 2024