Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] Fix launch dimension triplet for ROCm #19582

Closed
wants to merge 1 commit into from

Conversation

hsharsha
Copy link
Contributor

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF

@hsharsha hsharsha changed the title Revert 071c2ba and fix launch dimension triplet Revert edf18ce and fix launch dimension triplet Nov 21, 2024
@hsharsha
Copy link
Contributor Author

@olegshyshkov This patch tries to address some of the violations introduced in edf18ce

@olegshyshkov
Copy link
Member

I would prefer that we don't revert edf18ce.

parallel_loop_emitter.cc is a very old part of the emitter that is used only for a handful special instruction, so I wouldn't use it as a ground of truth. That change was aimed for Nvidia GPU, but I understand that ROCm has different requirement.

I think the solution here is to have a backend-specific logic for the check and to distribute blocks.

@hsharsha
Copy link
Contributor Author

hsharsha commented Nov 21, 2024

I would prefer that we don't revert edf18ce.

@olegshyshkov I encountered a failure in jax maxtext model at parallel_loop_emitter that blocks.y > 1. The same would fail for nvidia as well. That was the reason for reverting.

@olegshyshkov
Copy link
Member

Interesting. Could you share an HLO snippet with the fusion that causes the failure?

@hsharsha
Copy link
Contributor Author

@olegshyshkov I am trying to get a minimal working hlo snippet to reproduce this error. I am facing problems with hlo_bisect as it is aborting for various reason. For now I am attaching the stack trace with run_hlo_module utility

F0000 00:00:1732663728.638859 1261787 parallel_loop_emitter.cc:170] Check failed: launch_dimensions_.block_counts().y == 1 (4 vs. 1)
*** Check failure stack trace: ***
    @     0x557e7f5e437d  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x557e77cafc10  xla::gpu::ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock()
    @     0x557e77cafea2  xla::gpu::ParallelLoopEmitter::EmitSerialLoop()
    @     0x557e77cb02f8  xla::gpu::ParallelLoopEmitter::EmitLoop()
    @     0x557e77cadd7d  xla::gpu::LoopFusion::EmitKernel()
    @     0x557e77e05023  xla::gpu::KernelFusionEmitterBase::Emit()::{lambda()#1}::operator()()
    @     0x557e77e05599  std::_Function_handler<>::_M_invoke()
    @     0x557e77e6f913  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x557e77e6ff0f  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x557e77e04361  xla::gpu::KernelFusionEmitterBase::Emit()
    @     0x557e77929e46  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x557e7792a964  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x557e7792d21c  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x557e7792de3b  xla::gpu::IrEmitterUnnested::EmitCommandBufferThunk()
    @     0x557e7792aa8a  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x557e7792d21c  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x557e7769639b  xla::gpu::CompileModuleToLlvmIr()
    @     0x557e7768cdb4  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x557e7768df40  xla::gpu::GpuCompiler::RunBackend()
    @     0x557e7b555e56  xla::LLVMCompiler::Compile()
    @     0x557e77615bae  xla::Compiler::Compile()
    @     0x557e77616595  xla::HloRunner::CreateExecutableWithBufferAssignment()
    @     0x557e7761ceae  xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment()
    @     0x557e7761d202  xla::HloRunner::Execute()
    @     0x557e77620f82  xla::HloRunnerInterface::Execute()
    @     0x557e768fed58  xla::(anonymous namespace)::ExecuteWithRunner()
    @     0x557e768ff5d9  xla::(anonymous namespace)::RunAndCompareInternal()
    @     0x557e76902c2f  xla::RunAndCompare()
    @     0x557e7690335f  xla::RunAndCompare()
    @     0x557e767f3c5d  main

Let me know if it helps, I will continue to get a working hlo snippet.

@hsharsha
Copy link
Contributor Author

@olegshyshkov Following snippet can cause the error

HloModule jit_train_step, entry_computation_layout={(s32[])->bf16[80,7,8192,8192]}

fused_broadcast.3 {
  constant_135_2 = bf16[] constant(0)
  ROOT broadcast.599.1 = bf16[80,7,8192,8192]{3,2,1,0} broadcast(constant_135_2), dimensions={}
}


ENTRY command_buffer {
  p = s32[] parameter(0)
  ROOT loop_broadcast_fusion.3 = bf16[80,7,8192,8192]{3,2,1,0} fusion(), kind=kLoop, calls=fused_broadcast.3
} // command_buffer

But I am not able to reproduce this on this branch. When I run run_hlo_module it gives llvm finger print error

INFO: Running command line: bazel-bin/xla/tools/run_hlo_module '--platform=GPU' '--reference_platform=default' '--xla_disable_all_hlo_passes=true' /path/to/trim.txt

 ** Running /home/hahavanu/xla/trim.txt**
Running HLO module with runner ROCM...
2024-11-27 17:26:31.539523: I xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 6139028976987762984
Segmentation fault (core dumped)

But can be reproduced with run_hlo_module utility on this branch https://github.com/ROCm/xla/tree/ci_rv_mt_fix_launch_dims

@hsharsha
Copy link
Contributor Author

Also On NVIDIA GPUs, you may have to increase the tensor size so as to exceed block_dim_limit().x in https://github.com/openxla/xla/blob/main/xla/service/gpu/launch_dimensions.cc#L45

@hsharsha
Copy link
Contributor Author

hsharsha commented Dec 5, 2024

@olegshyshkov were you able to reproduce the error?

@olegshyshkov
Copy link
Member

Sorry for the delayed reply!

But I am not able to reproduce this on this branch. When I run run_hlo_module it gives llvm finger print error

This is not an LLVM finger print error. You're getting a Segmentation fault (core dumped).

I'm not sure how to reproduce that, because the output of the HLO reproduce that you have is 75GB. We allow about 80% of GPU memory to be used for the buffers, so on H100 with 80GB we have around 64GB.

The way to go here would be to have backend-specific logic to distribute block_ids and threads along the dimensions.

Launch dimension should be of the form
((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in
(parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]
@hsharsha hsharsha force-pushed the ci_fix_launch_dim_20241121 branch from 9b880c6 to 9a46402 Compare December 16, 2024 15:32
@hsharsha hsharsha changed the title Revert edf18ce and fix launch dimension triplet [ROCm] Fix launch dimension triplet for ROCm Dec 16, 2024
@hsharsha
Copy link
Contributor Author

@olegshyshkov I have updated this PR to accommodate platform specific changes for launch dims

@i-chaochen
Copy link
Contributor

please have a check @xla-rotation

thanks!

return LaunchDimensions(se::BlockDim(num_blocks_x, num_blocks_y, 1),
se::ThreadDim(threads_per_block, 1, 1));
return LaunchDimensions(se::BlockDim(num_blocks_x, num_blocks_y, 1),
se::ThreadDim(threads_per_block, 1, 1));\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the extra '\' at the end of the line.

copybara-service bot pushed a commit that referenced this pull request Dec 23, 2024
Imported from GitHub PR #19582

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF
Copybara import of the project:

--
9a46402 by Harsha HS <[email protected]>:

[ROCm] Fix kernel launch dimension

Launch dimension should be of the form
((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in
(parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

Merging this change closes #19582

FUTURE_COPYBARA_INTEGRATE_REVIEW=#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402
PiperOrigin-RevId: 708559118
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 23, 2024
Imported from GitHub PR openxla/xla#19582

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF
Copybara import of the project:

--
9a46402b27bc8b32a2bc621ae2cab01e2c65f017 by Harsha HS <[email protected]>:

[ROCm] Fix kernel launch dimension

Launch dimension should be of the form
((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in
(parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

Merging this change closes #19582

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017
PiperOrigin-RevId: 708559118
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 23, 2024
Imported from GitHub PR openxla/xla#19582

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF
Copybara import of the project:

--
9a46402b27bc8b32a2bc621ae2cab01e2c65f017 by Harsha HS <[email protected]>:

[ROCm] Fix kernel launch dimension

Launch dimension should be of the form
((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in
(parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

Merging this change closes #19582

PiperOrigin-RevId: 709138523
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 23, 2024
FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402b27bc8b32a2bc621ae2cab01e2c65f017
PiperOrigin-RevId: 709118438
hsharsha added a commit to ROCm/xla that referenced this pull request Jan 22, 2025
Imported from GitHub PR openxla#19582

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF
Copybara import of the project:

--
9a46402 by Harsha HS <[email protected]>:

[ROCm] Fix kernel launch dimension

Launch dimension should be of the form
((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in
(parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]

Merging this change closes openxla#19582

COPYBARA_INTEGRATE_REVIEW=openxla#19582 from ROCm:ci_fix_launch_dim_20241121 9a46402
PiperOrigin-RevId: 709138523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants