Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "[LLVMGPU][ROCm] Add MFMA_F32_16x16x4_F32 instruction" #17894

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

ScottTodd
Copy link
Member

Reverts #17847

This broke SDXL rocm pipeline tests on mi300, see #17847 (comment). The tests aren't showing error messages (root:benchmark_sdxl_rocm.py:31 Command failed with error: b'') so I can't easily tell what the issue is, nod-ai/SHARK-TestSuite#286 is filed to improve the situation there.

@ScottTodd
Copy link
Member Author

Going to see if I can get logs locally while the CI sanity check runs on this.

@ScottTodd
Copy link
Member Author

Hmm... seems like compilation succeeded but benchmarking failed. Possibly a runtime error from the driver if unsupported instructions were included or something. I don't have an easy way to check right now.

@ScottTodd ScottTodd merged commit 02c2000 into main Jul 12, 2024
55 checks passed
@ScottTodd ScottTodd deleted the revert-17847-wmma_ab_f32_c_f32 branch July 12, 2024 21:33
@pashu123
Copy link
Contributor

@ScottTodd Do you have instruction to repro this?

@ScottTodd
Copy link
Member Author

@ScottTodd Do you have instruction to repro this?

The logs should (ideally) provide enough info: https://github.com/iree-org/iree/actions/runs/9911235296/job/27383808262#step:16:46

INFO root:benchmark_sdxl_rocm.py:25 ('Exec:', 'iree-compile /home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/benchmarks/sdxl/sdxl_pipeline_bench_f16.mlir --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-global-opt-propagate-transposes=true --iree-codegen-llvmgpu-use-vector-distribution --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-opt-outer-dim-concat=true --iree-llvmgpu-enable-prefetch -o /home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/benchmarks/sdxl/sdxl_full_pipeline_fp16_rocm.vmfb')
INFO root:benchmark_sdxl_rocm.py:25 ('Exec:', 'iree-benchmark-module --device=hip://0 --device_allocator=caching --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-prompt-encoder-tank/model_gpu_rocm_real_weights.vmfb --parameters=model=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-prompt-encoder-tank/real_weights.irpa --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/model_gpu_rocm_real_weights.vmfb --parameters=model=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/real_weights.irpa --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-vae-decode-tank/model_gpu_rocm_real_weights.vmfb --parameters=model=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/pytorch/models/sdxl-vae-decode-tank/real_weights.irpa --module=/home/esaimana/actions-runner/_work/iree/iree/SHARK-TestSuite/iree_tests/benchmarks/sdxl/sdxl_full_pipeline_fp16_rocm.vmfb --function=tokens_to_image --input=1x4x128x128xf16 --input=1xf16 --input=1x64xi64 --input=1x64xi64 --input=1x64xi64 --input=1x64xi64 --benchmark_repetitions=10 --benchmark_min_warmup_time=3.0')

Source files might be split between https://github.com/nod-ai/SHARK-TestSuite/tree/4486c44a3d9e61dd20317fe1d23be71ff1610f32/iree_tests/benchmarks/sdxl and https://github.com/nod-ai/SHARK-TestSuite/tree/4486c44a3d9e61dd20317fe1d23be71ff1610f32/iree_tests/pytorch/models

@ScottTodd
Copy link
Member Author

I was also planning on fixing the script to print the right error logs and then getting a CI run with that PR again to check that logs appear.

ScottTodd added a commit to ScottTodd/iree that referenced this pull request Jul 15, 2024
@ScottTodd
Copy link
Member Author

Trying to get more info with #17907

@ScottTodd
Copy link
Member Author

https://github.com/iree-org/iree/actions/runs/9944277339/job/27470515222?pr=17907#step:7:171

popenargs = (['iree-run-module', '--device=hip', '--module=/home/esaimana/iree_tests_cache/artifacts/sdxl_unet/model.rocm_gfx942.v...', '--module=/home/esaimana/iree_tests_cache/artifacts/sdxl_unet/sdxl_unet_pipeline_bench_f16.rocm_gfx942.vmfb', ...],)
kwargs = {'cwd': PosixPath('/home/esaimana/iree_tests_cache/artifacts/sdxl_unet'), 'stderr': -1, 'stdout': -1}
process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...>
stdout = b''
stderr = b"iree/runtime/src/iree/hal/drivers/hip/native_executable.c:186: INTERNAL; HIP driver error 'hipErrorNoBinaryForGpu' (...e/stable_diffusion_xl_base_1_0_PNDM_64_1024x1024_fp16_unet_3.mlir:1717:10; creating VM context; creating run context\n"

ScottTodd added a commit to ScottTodd/iree that referenced this pull request Jul 15, 2024
pashu123 added a commit to pashu123/iree that referenced this pull request Jul 16, 2024
pashu123 added a commit to pashu123/iree that referenced this pull request Jul 19, 2024
pashu123 added a commit that referenced this pull request Jul 24, 2024
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024
…rg#17894)

Reverts iree-org#17847

This broke SDXL rocm pipeline tests on mi300, see
iree-org#17847 (comment). The
tests aren't showing error messages (`root:benchmark_sdxl_rocm.py:31
Command failed with error: b''`) so I can't easily tell what the issue
is, nod-ai/SHARK-TestSuite#286 is filed to
improve the situation there.

Signed-off-by: Lubo Litchev <[email protected]>
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants