[ROCm] Fix launch dimension triplet for ROCm #19582

hsharsha · 2024-11-21T00:31:06Z

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF

hsharsha · 2024-11-21T00:33:48Z

@olegshyshkov This patch tries to address some of the violations introduced in edf18ce

olegshyshkov · 2024-11-21T21:03:44Z

I would prefer that we don't revert edf18ce.

parallel_loop_emitter.cc is a very old part of the emitter that is used only for a handful special instruction, so I wouldn't use it as a ground of truth. That change was aimed for Nvidia GPU, but I understand that ROCm has different requirement.

I think the solution here is to have a backend-specific logic for the check and to distribute blocks.

hsharsha · 2024-11-21T21:49:40Z

I would prefer that we don't revert edf18ce.

@olegshyshkov I encountered a failure in jax maxtext model at parallel_loop_emitter that blocks.y > 1. The same would fail for nvidia as well. That was the reason for reverting.

olegshyshkov · 2024-11-21T22:20:30Z

Interesting. Could you share an HLO snippet with the fusion that causes the failure?

hsharsha · 2024-11-26T23:31:15Z

@olegshyshkov I am trying to get a minimal working hlo snippet to reproduce this error. I am facing problems with hlo_bisect as it is aborting for various reason. For now I am attaching the stack trace with run_hlo_module utility

F0000 00:00:1732663728.638859 1261787 parallel_loop_emitter.cc:170] Check failed: launch_dimensions_.block_counts().y == 1 (4 vs. 1)
*** Check failure stack trace: ***
    @     0x557e7f5e437d  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x557e77cafc10  xla::gpu::ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock()
    @     0x557e77cafea2  xla::gpu::ParallelLoopEmitter::EmitSerialLoop()
    @     0x557e77cb02f8  xla::gpu::ParallelLoopEmitter::EmitLoop()
    @     0x557e77cadd7d  xla::gpu::LoopFusion::EmitKernel()
    @     0x557e77e05023  xla::gpu::KernelFusionEmitterBase::Emit()::{lambda()#1}::operator()()
    @     0x557e77e05599  std::_Function_handler<>::_M_invoke()
    @     0x557e77e6f913  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x557e77e6ff0f  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x557e77e04361  xla::gpu::KernelFusionEmitterBase::Emit()
    @     0x557e77929e46  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x557e7792a964  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x557e7792d21c  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x557e7792de3b  xla::gpu::IrEmitterUnnested::EmitCommandBufferThunk()
    @     0x557e7792aa8a  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x557e7792d21c  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x557e7769639b  xla::gpu::CompileModuleToLlvmIr()
    @     0x557e7768cdb4  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x557e7768df40  xla::gpu::GpuCompiler::RunBackend()
    @     0x557e7b555e56  xla::LLVMCompiler::Compile()
    @     0x557e77615bae  xla::Compiler::Compile()
    @     0x557e77616595  xla::HloRunner::CreateExecutableWithBufferAssignment()
    @     0x557e7761ceae  xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment()
    @     0x557e7761d202  xla::HloRunner::Execute()
    @     0x557e77620f82  xla::HloRunnerInterface::Execute()
    @     0x557e768fed58  xla::(anonymous namespace)::ExecuteWithRunner()
    @     0x557e768ff5d9  xla::(anonymous namespace)::RunAndCompareInternal()
    @     0x557e76902c2f  xla::RunAndCompare()
    @     0x557e7690335f  xla::RunAndCompare()
    @     0x557e767f3c5d  main

Let me know if it helps, I will continue to get a working hlo snippet.

hsharsha · 2024-11-27T17:30:57Z

@olegshyshkov Following snippet can cause the error

HloModule jit_train_step, entry_computation_layout={(s32[])->bf16[80,7,8192,8192]}

fused_broadcast.3 {
  constant_135_2 = bf16[] constant(0)
  ROOT broadcast.599.1 = bf16[80,7,8192,8192]{3,2,1,0} broadcast(constant_135_2), dimensions={}
}


ENTRY command_buffer {
  p = s32[] parameter(0)
  ROOT loop_broadcast_fusion.3 = bf16[80,7,8192,8192]{3,2,1,0} fusion(), kind=kLoop, calls=fused_broadcast.3
} // command_buffer

But I am not able to reproduce this on this branch. When I run run_hlo_module it gives llvm finger print error

INFO: Running command line: bazel-bin/xla/tools/run_hlo_module '--platform=GPU' '--reference_platform=default' '--xla_disable_all_hlo_passes=true' /path/to/trim.txt

 ** Running /home/hahavanu/xla/trim.txt**
Running HLO module with runner ROCM...
2024-11-27 17:26:31.539523: I xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 6139028976987762984
Segmentation fault (core dumped)

But can be reproduced with run_hlo_module utility on this branch https://github.com/ROCm/xla/tree/ci_rv_mt_fix_launch_dims

hsharsha · 2024-11-27T17:47:20Z

Also On NVIDIA GPUs, you may have to increase the tensor size so as to exceed block_dim_limit().x in https://github.com/openxla/xla/blob/main/xla/service/gpu/launch_dimensions.cc#L45

hsharsha · 2024-12-05T16:27:41Z

@olegshyshkov were you able to reproduce the error?

olegshyshkov · 2024-12-10T11:05:13Z

Sorry for the delayed reply!

But I am not able to reproduce this on this branch. When I run run_hlo_module it gives llvm finger print error

This is not an LLVM finger print error. You're getting a Segmentation fault (core dumped).

I'm not sure how to reproduce that, because the output of the HLO reproduce that you have is 75GB. We allow about 80% of GPU memory to be used for the buffers, so on H100 with 80GB we have around 64GB.

The way to go here would be to have backend-specific logic to distribute block_ids and threads along the dimensions.

Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

hsharsha · 2024-12-16T15:39:19Z

@olegshyshkov I have updated this PR to accommodate platform specific changes for launch dims

i-chaochen · 2024-12-21T07:52:45Z

please have a check @xla-rotation

thanks!

olegshyshkov · 2024-12-21T10:05:40Z

xla/service/gpu/launch_dimensions.cc

-  return LaunchDimensions(se::BlockDim(num_blocks_x, num_blocks_y, 1),
-                          se::ThreadDim(threads_per_block, 1, 1));
+    return LaunchDimensions(se::BlockDim(num_blocks_x, num_blocks_y, 1),
+                          se::ThreadDim(threads_per_block, 1, 1));\


Please remove the extra '\' at the end of the line.

Imported from GitHub PR #19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes #19582 FUTURE_COPYBARA_INTEGRATE_REVIEW=#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402 PiperOrigin-RevId: 708559118

Imported from GitHub PR openxla/xla#19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes #19582 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 PiperOrigin-RevId: 708559118

Imported from GitHub PR openxla/xla#19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes #19582 PiperOrigin-RevId: 709138523

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 PiperOrigin-RevId: 709118438

Imported from GitHub PR openxla#19582 Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF Copybara import of the project: -- 9a46402 by Harsha HS <[email protected]>: [ROCm] Fix kernel launch dimension Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171] Merging this change closes openxla#19582 COPYBARA_INTEGRATE_REVIEW=openxla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402 PiperOrigin-RevId: 709138523

hsharsha changed the title ~~Revert 071c2ba and fix launch dimension triplet~~ Revert edf18ce and fix launch dimension triplet Nov 21, 2024

dimitar-asenov requested a review from olegshyshkov November 21, 2024 10:26

hsharsha mentioned this pull request Nov 21, 2024

[ROCm] Fix kernel launch dimension ROCm/xla#72

Open

[ROCm] Fix kernel launch dimension

9a46402

Launch dimension should be of the form ((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in (parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

hsharsha force-pushed the ci_fix_launch_dim_20241121 branch from 9b880c6 to 9a46402 Compare December 16, 2024 15:32

hsharsha changed the title ~~Revert edf18ce and fix launch dimension triplet~~ [ROCm] Fix launch dimension triplet for ROCm Dec 16, 2024

olegshyshkov approved these changes Dec 21, 2024

View reviewed changes

copybara-service bot mentioned this pull request Dec 23, 2024

PR #19582: [ROCm] Fix launch dimension triplet for ROCm #20834

Merged

copybara-service bot mentioned this pull request Dec 23, 2024

PR #19582: [ROCm] Fix launch dimension triplet for ROCm tensorflow/tensorflow#83630

Merged

copybara-service bot closed this in cec150b Dec 23, 2024

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 23, 2024

Reverts f045cbb

1f3084f

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017 PiperOrigin-RevId: 709118438

copybara-service bot mentioned this pull request Dec 23, 2024

Reverts f045cbb638d654fe87fae3774f3805037ebf53e4 tensorflow/tensorflow#83627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix launch dimension triplet for ROCm #19582

[ROCm] Fix launch dimension triplet for ROCm #19582

hsharsha commented Nov 21, 2024

hsharsha commented Nov 21, 2024

olegshyshkov commented Nov 21, 2024

hsharsha commented Nov 21, 2024 •

edited

Loading

olegshyshkov commented Nov 21, 2024

hsharsha commented Nov 26, 2024

hsharsha commented Nov 27, 2024

hsharsha commented Nov 27, 2024

hsharsha commented Dec 5, 2024

olegshyshkov commented Dec 10, 2024

hsharsha commented Dec 16, 2024

i-chaochen commented Dec 21, 2024

olegshyshkov Dec 21, 2024

[ROCm] Fix launch dimension triplet for ROCm #19582

[ROCm] Fix launch dimension triplet for ROCm #19582

Conversation

hsharsha commented Nov 21, 2024

hsharsha commented Nov 21, 2024

olegshyshkov commented Nov 21, 2024

hsharsha commented Nov 21, 2024 • edited Loading

olegshyshkov commented Nov 21, 2024

hsharsha commented Nov 26, 2024

hsharsha commented Nov 27, 2024

hsharsha commented Nov 27, 2024

hsharsha commented Dec 5, 2024

olegshyshkov commented Dec 10, 2024

hsharsha commented Dec 16, 2024

i-chaochen commented Dec 21, 2024

olegshyshkov Dec 21, 2024

Choose a reason for hiding this comment

hsharsha commented Nov 21, 2024 •

edited

Loading