-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Fix launch dimension triplet for ROCm #19582
Conversation
071c2ba
and fix launch dimension tripletedf18ce
and fix launch dimension triplet
@olegshyshkov This patch tries to address some of the violations introduced in edf18ce |
I would prefer that we don't revert edf18ce. parallel_loop_emitter.cc is a very old part of the emitter that is used only for a handful special instruction, so I wouldn't use it as a ground of truth. That change was aimed for Nvidia GPU, but I understand that ROCm has different requirement. I think the solution here is to have a backend-specific logic for the check and to distribute blocks. |
@olegshyshkov I encountered a failure in jax maxtext model at parallel_loop_emitter that blocks.y > 1. The same would fail for nvidia as well. That was the reason for reverting. |
Interesting. Could you share an HLO snippet with the fusion that causes the failure? |
@olegshyshkov I am trying to get a minimal working hlo snippet to reproduce this error. I am facing problems with hlo_bisect as it is aborting for various reason. For now I am attaching the stack trace with run_hlo_module utility
Let me know if it helps, I will continue to get a working hlo snippet. |
@olegshyshkov Following snippet can cause the error
But I am not able to reproduce this on this branch. When I run run_hlo_module it gives llvm finger print error
But can be reproduced with |
Also On NVIDIA GPUs, you may have to increase the tensor size so as to exceed block_dim_limit().x in https://github.com/openxla/xla/blob/main/xla/service/gpu/launch_dimensions.cc#L45 |
@olegshyshkov were you able to reproduce the error? |
Sorry for the delayed reply!
This is not an LLVM finger print error. You're getting a I'm not sure how to reproduce that, because the output of the HLO reproduce that you have is 75GB. We allow about 80% of GPU memory to be used for the buffers, so on H100 with 80GB we have around 64GB. The way to go here would be to have backend-specific logic to distribute block_ids and threads along the dimensions. |
Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]
9b880c6
to
9a46402
Compare
edf18ce
and fix launch dimension triplet
@olegshyshkov I have updated this PR to accommodate platform specific changes for launch dims |
please have a check @xla-rotation thanks! |
return LaunchDimensions(se::BlockDim(num_blocks_x, num_blocks_y, 1), | ||
se::ThreadDim(threads_per_block, 1, 1)); | ||
return LaunchDimensions(se::BlockDim(num_blocks_x, num_blocks_y, 1), | ||
se::ThreadDim(threads_per_block, 1, 1));\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the extra '\' at the end of the line.
Imported from GitHub PR #19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes #19582 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402 PiperOrigin-RevId: 708559118
Imported from GitHub PR openxla/xla#19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes #19582 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 PiperOrigin-RevId: 708559118
Imported from GitHub PR openxla/xla#19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes #19582 PiperOrigin-RevId: 709138523
FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 PiperOrigin-RevId: 709118438
Imported from GitHub PR openxla#19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes openxla#19582 COPYBARA_INTEGRATE_REVIEW=openxla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402 PiperOrigin-RevId: 709138523
Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF