Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert edf18ce and fix launch dimension triplet #19582

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hsharsha
Copy link
Contributor

Owing to checks in https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171 launch dimension can be of the form ((block.x, 1, 1), (thread.x, thread.y, 1)). And in ROCm it is expected that (block.x * thread.x) <= 0xFFFFFFFF

…han the hardware limit per dim."

This reverts commit edf18ce.
Launch dimension should be of the form
((block.x, 1, 1), (thread.x, thready, 1)) to accommodate checks in
(parallel_loop_emitter.cc)[https://github.com/openxla/xla/blob/main/xla/service/gpu/parallel_loop_emitter.cc#L169-L171]
@hsharsha hsharsha changed the title Revert 071c2ba and fix launch dimension triplet Revert edf18ce and fix launch dimension triplet Nov 21, 2024
@hsharsha
Copy link
Contributor Author

@olegshyshkov This patch tries to address some of the violations introduced in edf18ce

@olegshyshkov
Copy link
Member

I would prefer that we don't revert edf18ce.

parallel_loop_emitter.cc is a very old part of the emitter that is used only for a handful special instruction, so I wouldn't use it as a ground of truth. That change was aimed for Nvidia GPU, but I understand that ROCm has different requirement.

I think the solution here is to have a backend-specific logic for the check and to distribute blocks.

@hsharsha
Copy link
Contributor Author

hsharsha commented Nov 21, 2024

I would prefer that we don't revert edf18ce.

@olegshyshkov I encountered a failure in jax maxtext model at parallel_loop_emitter that blocks.y > 1. The same would fail for nvidia as well. That was the reason for reverting.

@olegshyshkov
Copy link
Member

Interesting. Could you share an HLO snippet with the fusion that causes the failure?

@hsharsha
Copy link
Contributor Author

@olegshyshkov I am trying to get a minimal working hlo snippet to reproduce this error. I am facing problems with hlo_bisect as it is aborting for various reason. For now I am attaching the stack trace with run_hlo_module utility

F0000 00:00:1732663728.638859 1261787 parallel_loop_emitter.cc:170] Check failed: launch_dimensions_.block_counts().y == 1 (4 vs. 1)
*** Check failure stack trace: ***
    @     0x557e7f5e437d  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x557e77cafc10  xla::gpu::ParallelLoopEmitter::EmitIndexAndSetExitBasicBlock()
    @     0x557e77cafea2  xla::gpu::ParallelLoopEmitter::EmitSerialLoop()
    @     0x557e77cb02f8  xla::gpu::ParallelLoopEmitter::EmitLoop()
    @     0x557e77cadd7d  xla::gpu::LoopFusion::EmitKernel()
    @     0x557e77e05023  xla::gpu::KernelFusionEmitterBase::Emit()::{lambda()#1}::operator()()
    @     0x557e77e05599  std::_Function_handler<>::_M_invoke()
    @     0x557e77e6f913  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x557e77e6ff0f  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x557e77e04361  xla::gpu::KernelFusionEmitterBase::Emit()
    @     0x557e77929e46  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x557e7792a964  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x557e7792d21c  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x557e7792de3b  xla::gpu::IrEmitterUnnested::EmitCommandBufferThunk()
    @     0x557e7792aa8a  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x557e7792d21c  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x557e7769639b  xla::gpu::CompileModuleToLlvmIr()
    @     0x557e7768cdb4  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x557e7768df40  xla::gpu::GpuCompiler::RunBackend()
    @     0x557e7b555e56  xla::LLVMCompiler::Compile()
    @     0x557e77615bae  xla::Compiler::Compile()
    @     0x557e77616595  xla::HloRunner::CreateExecutableWithBufferAssignment()
    @     0x557e7761ceae  xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment()
    @     0x557e7761d202  xla::HloRunner::Execute()
    @     0x557e77620f82  xla::HloRunnerInterface::Execute()
    @     0x557e768fed58  xla::(anonymous namespace)::ExecuteWithRunner()
    @     0x557e768ff5d9  xla::(anonymous namespace)::RunAndCompareInternal()
    @     0x557e76902c2f  xla::RunAndCompare()
    @     0x557e7690335f  xla::RunAndCompare()
    @     0x557e767f3c5d  main

Let me know if it helps, I will continue to get a working hlo snippet.

@hsharsha
Copy link
Contributor Author

@olegshyshkov Following snippet can cause the error

HloModule jit_train_step, entry_computation_layout={(s32[])->bf16[80,7,8192,8192]}

fused_broadcast.3 {
  constant_135_2 = bf16[] constant(0)
  ROOT broadcast.599.1 = bf16[80,7,8192,8192]{3,2,1,0} broadcast(constant_135_2), dimensions={}
}


ENTRY command_buffer {
  p = s32[] parameter(0)
  ROOT loop_broadcast_fusion.3 = bf16[80,7,8192,8192]{3,2,1,0} fusion(), kind=kLoop, calls=fused_broadcast.3
} // command_buffer

But I am not able to reproduce this on this branch. When I run run_hlo_module it gives llvm finger print error

INFO: Running command line: bazel-bin/xla/tools/run_hlo_module '--platform=GPU' '--reference_platform=default' '--xla_disable_all_hlo_passes=true' /path/to/trim.txt

 ** Running /home/hahavanu/xla/trim.txt**
Running HLO module with runner ROCM...
2024-11-27 17:26:31.539523: I xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 6139028976987762984
Segmentation fault (core dumped)

But can be reproduced with run_hlo_module utility on this branch https://github.com/ROCm/xla/tree/ci_rv_mt_fix_launch_dims

@hsharsha
Copy link
Contributor Author

Also On NVIDIA GPUs, you may have to increase the tensor size so as to exceed block_dim_limit().x in https://github.com/openxla/xla/blob/main/xla/service/gpu/launch_dimensions.cc#L45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants