[RFC] Improve windowed einsum(collective matmul) for XLA:GPU #8865

Tixxx · 2024-01-26T20:26:51Z

Tixxx
Jan 26, 2024
Collaborator

Motivation

Intra-layer model parallelism or tensor parallelism has become an efficient sharding strategy for training LLMs. This strategy is generally implemented by distributing MLP layers to multiple devices, performing AllGather before MLP gemm and performing ReduceScatter after MLP gemm. However, with increasingly large features and hidden dims, this strategy can quickly be bottlenecked by the lack of overlap between collective and compute since only one can happen at a time. In order to break through the bottleneck, a collective matmul is proposed in this paper to further partition large gemms into multiple parts, inject collective_permutes between each partition for pipelined execution and update their corresponding partitions in the result matrix. This optimization has been implemented for TPU and shown performance improvements on some models. We think having this optimization available for XLA GPU pipeline will also be beneficial.

High-level Design

HLO Graph

The current HandleDot has EmitWindowedDotGeneral function in the SpmdPartitioningVisitor which provides logic to rewrite AllGather+gemm or gemm+reduceScatter into a while loop with trip count equal to number of partitions. For example, a full HLO Allgather+gemm pattern before pattern matching:

HloModule pjit__unnamed_wrapped_function_, entry_computation_layout={(bf16[1,1024,24576]{2,1,0}, bf16[24576,24576]{1,0})->bf16[1,1024,98304]{2,1,0}}, num_partitions=4

ENTRY main.7_spmd {
  param = bf16[1,1024,24576]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]}
  reshape.2 = bf16[1024,24576]{1,0} reshape(param)
  all-gather = bf16[4096,24576]{1,0} all-gather(reshape.2), channel_id=1, replica_groups={{0,1,2,3}}, dimensions={0}, use_global_device_ids=true
  param.1 = bf16[24576,24576]{1,0} parameter(1), sharding={devices=[1,4]<=[4]}
  dot.1 = bf16[4096,24576]{1,0} dot(all-gather, param.1), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  reshape.5 = bf16[1,4,1024,24576]{3,2,1,0} reshape(dot.1)
  all-to-all = bf16[1,4,1024,24576]{3,2,1,0} all-to-all(reshape.5), channel_id=2, replica_groups={{0,1,2,3}}, dimensions={1}
  transpose = bf16[1,1024,4,24576]{3,1,2,0} transpose(all-to-all), dimensions={0,2,1,3}
  ROOT reshape.6 = bf16[1,1024,98304]{2,1,0} reshape(transpose)
}

The above all-gather+gemm pattern will be rewritten into a while loop with sharded dots and collective-permutes to send other shards to neighbors. At a high-level, the rewritten graph has this structure depicted using pseudo-HLO for simplicity (full HLO in Appendix 1):

all_gather_while_body (initial_matrix, sharded_lhs, rhs) {
dot = dot(sharded_lhs, rhs)
// bidirectional sendrecv sharded lhs to/from peer on collective stream
collective-permute = collective-permute(sharded_lhs)
// Update intermediate result
dynamic-update-slice = dynamic-update-slice(initial_matrix, dot)
ROOT (dynamic-update-slice, collective-permute, rhs)
}

ENTRY main {
  lhs = parameter(0), sharding={devices=[1,4,1]<=[4]}
  rhs = parameter(1), sharding={devices=[1,4]<=[4]}
  c0 = constant(0)
  init = broadcast(c0)
  input_tuple = tuple(param.2, init, param.3)
 ROOT while.1 = while(input_tuple), body=all_gather_while_body, backend_config={"known_trip_count":{"n":"4"}}
}

The above loop body implements the exact logic of collective matmul, but it has some drawbacks in terms of performance due to the fact that each while loop iteration contains a single gemm for 1 partition. One observation is that once a worker receives all data from its peer for the second partition of a matrix, the second partition’s gemm can start right away concurrently on another stream while the gemm of the first partition is running. The above logic can then be improved by unrolling the loop by a factor of 2 to allow multiple gemms running in the while loop body for a more efficient execution. An example of a 2-partition overlapping execution will look like:

ag_while_body (initial_matrix, sharded_lhs, rhs) {
  // bidirectional sendrecv sharded lhs to/from peer on collective stream
  collective-permute-start = collective-permute-start(sharded_lhs), operation_queue_id=3
  // Concurrently on another compute stream
  dot = dot(sharded_lhs, rhs), operation_queue_id=2
  // Await on main stream
  collective-permute-done = collective-permute-done(collective-permute-start), operation_queue_id
=0
  // Run another gemm on main stream when data is ready
  dot2 = dot(collective-permute-done, rhs), operation_queue_id=0
  // Update intermediate result on the main stream, await on operation_queue_id=2
  dynamic-update-slice = dynamic-update-slice(initial_matrix, dot), operation_queue_id=0, wait_on_operation_queues={2}
  dynamic-update-slice2 = dynamic-update-slice(dynamic-update-slice, dot2), operation_queue_id=0
  ROOT (dynamic-update-slice2, collective-permute-done, rhs)
}

The first dot in the above example will be run on a separate compute stream operation_queue_id=2. When its consumer, which is the first DUS, runs on the main compute stream, it will await the async event on the stream with operation_queue_id=2 due to data dependency. There are 2 alternatives to achieve this:

@jurahul suggested an idea to re-use while loop construct, to expand on that idea, we can achieve running multiple dots by unrolling the while loop by a factor of 2. Then dots from 2 partitions can be parallelized within one iteration. Benchmarks have shown that overlapping more than 2 gemms is not beneficial.
Another alternative is to manually construct the sharded dot and sendrecv sequences into a sub-computation region of a custom_call object. The instruction sequence will be very similar to the fully unrolled loop above. Since this is a custom call, other loop optimization passes won’t impact the custom call body so that frees us from adding special attributes. But we’d need to implement a separate thunk executor for this custom call.

Thunk Execution

Either of the alternatives above will require a multi-streamed execution. We will discuss how to achieve this using the while thunk alternative. The execution strategy will be the same for the other custom call alternative.

The current while thunk executes all compute thunk in a single stream. In order to do multi-stream execution, ExecuteParams will need to host multiple compute streams. The number can be controlled by a debug option for now. We will create the corresponding number of compute streams when executing the while thunk.

For optimal performance, we will need to add a operation_queue_id attribute to each instruction that will run on a non-default compute stream to instruct the runtime of which stream this kernel should be dispatched to. Non-attributed instructions are still dispatched in the default way: collectives to the collective stream; computes to the default compute stream. Note that the operation_queue_id is merely an opaque identifier, it doesn’t necessarily reflect the actual stream id on the hardware, thunks need to keep a mapping of the HLO operation_queue_id and the actual hardware stream id if needed.

Use AsyncStart and AsyncDone Ops with Synchronization thunk

Only adding stream attributes to existing instructions doesn’t change the liveness of buffers. This could have a drastic impact on buffer assignment for parallel gemms because the buffer assigner doesn’t know the gemms should consume separate buffers. We could change the logic of buffer assignment to only share buffers for kernels on the same stream. However, the buffer assignment is shared by all backends, having stream-specific logic is not reasonable without heavy refactoring of the code. @ezhulenev has suggested an approach here. The high-level idea is to add a pass to wrap compute kernels that don’t run on the main stream into asyncStart and asyncDone operations. There’s already infra set up to support liveness of buffers for async pairs so the buffer assignment should already be taken care of.
In order to be more explicit about parallel execution, we can introduce a synchronization thunk. The thunk will await on the streams of its operands and return when the data is available, an example to show the interface and definition of its ExecuteOnStream:

class AwaitThunk : public Thunk {
 public:
  AwaitThunk(int64_t source_stream_id, vector<int64_t> wait_on_streams):
    source_stream_id_(source_stream_id),
    wait_on_streams_(wait_on_streams) {}
 private:
  int64_t source_stream_id_;
  vector<int64_t> wait_on_streams_;
}

absl::Status AwaitThunk::ExecuteOnStream(const ExecuteParams& params) {
  se::Stream& source_stream = params.compute_streams[source_stream_id_];
  for(int64_t stream : wait_on_streams_) {
    source_stream.ThenWaitFor(params.compute_streams[stream])
  }
     return absl::OkStatus();
}

The ir emitter will emit this thunk right before:

asyncStart kernel that runs on a non-default compute stream
Consumer of the corresponding asyncDone op if it’s running on a different stream

Here’s a high-level flow of the lowering logic:

Other Considerations

Scheduling: For both alternatives, we will need to introduce a new scheduler resource type so LHS won’t try to overlap it with other collectives.
The number of shards are currently determined by a simple model in DotHandler. But we still need to determine the number of gemms to run in parallel. For the initial phase, we can assume to use 2 streams. The end goal is to use a cost model, possible GpuPerformanceModel, to determine concurrency. However we’d need to know what the dot will be lowered to triton or cublas, the current phase ordering of SMPD passes won’t suffice. We’d likely need to introduce another pass after all the gemm rewriters to use the cost model to assess whether we want gemms to actually execute concurrently or not.
Triggering condition of collective matmul, currently it’s controlled by a threshold value as an internal field which is disabled by default for GPU. We will keep this mechanism and expose the threshold in debug options so users can decide to trigger it based on their model size. The default threshold will need to be determined using heuristics once we conduct more experiments.

Appendix

HloModule pjit__unnamed_wrapped_function_, is_scheduled=true, entry_computation_layout={(bf16[1,1024,24576]{2,1,0}, bf16[24576,24576]{1,0})->bf16[1,1024,98304]{2,1,0}}, num_partitions=4

window_collective_permute {
  window.1 = bf16[1024,24576]{1,0} parameter(0)
  collective-permute-start = (bf16[1024,24576]{1,0}, bf16[1024,24576]{1,0}) collective-permute-start(window.1), channel_id=2, source_target_pairs={{0,3},{1,0},{2,1},{3,2}}, backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"collective_backend_config":{"is_sync":true,"no_parallel_custom_call":false}}
  collective-permute-done = bf16[1024,24576]{1,0} collective-permute-done(collective-permute-start)
  ROOT tuple.9 = (bf16[1024,24576]{1,0}) tuple(collective-permute-done)
}

fused_computation.6 {
  param_0.10 = bf16[1024,24576]{1,0} parameter(0)
  ROOT copy.21 = bf16[1024,24576]{1,0} copy(param_0.10)
}

last_iteration_noop {
  window = bf16[1024,24576]{1,0} parameter(0)
  wrapped_copy.4 = bf16[1024,24576]{1,0} fusion(window), kind=kLoop, calls=fused_computation.6
  ROOT tuple.10 = (bf16[1024,24576]{1,0}) tuple(wrapped_copy.4)
}

fused_computation {
  param_0 = bf16[4096,24576]{1,0} parameter(0)
  param_1 = bf16[1024,24576]{1,0} parameter(1)
  constant_38 = u32[] constant(0)
  param_2.5 = u32[] parameter(2)
  param_3.7 = u32[] parameter(3)
  add.6 = u32[] add(param_2.5, param_3.7)
  constant_37 = u32[] constant(3)
  and.3 = u32[] and(add.6, constant_37)
  clamp.3 = u32[] clamp(constant_38, and.3, constant_37)
  convert.3 = s32[] convert(clamp.3)
  constant_36 = s32[] constant(1024)
  multiply.6 = s32[] multiply(convert.3, constant_36)
  constant_35 = s32[] constant(0)
  ROOT dynamic-update-slice.3 = bf16[4096,24576]{1,0} dynamic-update-slice(param_0, param_1, multiply.6, constant_35)
} // fused_computation

fused_computation.2 {
  param_0.5 = u32[] parameter(0)
  constant_40 = u32[] constant(1)
  add.7 = u32[] add(param_0.5, constant_40)
  constant_39 = u32[] constant(4)
  ROOT compare.6 = pred[] compare(add.7, constant_39), direction=LT
}

fused_computation.7 {
  param_0.11 = bf16[1024,24576]{1,0} parameter(0)
  ROOT copy.22 = bf16[1024,24576]{1,0} copy(param_0.11)
}

fused_computation.8 {
  param_0.12 = u32[] parameter(0)
  param_1.4 = u32[] parameter(1)
  ROOT add.8 = u32[] add(param_0.12, param_1.4)
}

windowed_dot_general_body.clone {
  constant_29 = u32[] constant(1)
  param.4 = (bf16[1024,24576]{1,0}, bf16[24576,24576]{1,0}, bf16[4096,24576]{1,0}, u32[]) parameter(0)
  partition-id.2 = u32[] partition-id()
  get-tuple-element.35 = u32[] get-tuple-element(param.4), index=3
  fusion.2 = pred[] fusion(get-tuple-element.35), kind=kLoop, calls=fused_computation.2
  get-tuple-element.32 = bf16[1024,24576]{1,0} get-tuple-element(param.4), index=0
  get-tuple-element.37 = bf16[24576,24576]{1,0} get-tuple-element(param.4), index=1
  get-tuple-element.34 = bf16[4096,24576]{1,0} get-tuple-element(param.4), index=2
  custom-call.1 = (bf16[1024,24576]{1,0}, s8[4194304]{0}) custom-call(get-tuple-element.32, get-tuple-element.37), custom_call_target="__cublas$gemm", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"gemm_backend_config":{"alpha_real":1,"alpha_imag":0,"beta":0,"dot_dimension_numbers":{"lhs_contracting_dimensions":["1"],"rhs_contracting_dimensions":["0"],"lhs_batch_dimensions":[],"rhs_batch_dimensions":[]},"precision_config":{"operand_precision":["DEFAULT","DEFAULT"]},"epilogue":"DEFAULT","lhs_stride":"25165824","rhs_stride":"603979776","grad_x":false,"grad_y":false}}
  get-tuple-element.25 = bf16[1024,24576]{1,0} get-tuple-element(custom-call.1), index=0
  conditional.clone.1 = (bf16[1024,24576]{1,0}) conditional(fusion.2, get-tuple-element.32, get-tuple-element.32), true_computation=window_collective_permute, false_computation=last_iteration_noop, control-predecessors={custom-call.1}
  get-tuple-element.36 = bf16[1024,24576]{1,0} get-tuple-element(conditional.clone.1), index=0
  wrapped_copy.12 = bf16[1024,24576]{1,0} fusion(get-tuple-element.36), kind=kLoop, calls=fused_computation.7
  fusion = bf16[4096,24576]{1,0} fusion(get-tuple-element.34, get-tuple-element.25, get-tuple-element.35, partition-id.2), kind=kLoop, calls=fused_computation
  wrapped_add.4 = u32[] fusion(get-tuple-element.35, constant_29), kind=kLoop, calls=fused_computation.8, control-predecessors={fusion, fusion.2}
  ROOT tuple.13 = (bf16[1024,24576]{1,0}, bf16[24576,24576]{1,0}, bf16[4096,24576]{1,0}, u32[]) tuple(wrapped_copy.12, get-tuple-element.37, fusion, wrapped_add.4)
} // windowed_dot_general_body.clone

fused_computation.9 {
  param_0.13 = u32[] parameter(0)
  param_1.5 = u32[] parameter(1)
  ROOT compare.7 = pred[] compare(param_0.13, param_1.5), direction=LT
}

windowed_dot_general_cond.clone {
  constant_34 = u32[] constant(4)
  param.5 = (bf16[1024,24576]{1,0}, bf16[24576,24576]{1,0}, bf16[4096,24576]{1,0}, u32[]) parameter(0)
  get-tuple-element.19 = u32[] get-tuple-element(param.5), index=3
  ROOT wrapped_compare.5 = pred[] fusion(get-tuple-element.19, constant_34), kind=kLoop, calls=fused_computation.9
}

fused_computation.1 {
  param_0.2 = bf16[1,4,1024,24576]{3,2,0,1} parameter(0)
  bitcast.72 = bf16[4,1,1024,24576]{3,2,1,0} bitcast(param_0.2)
  ROOT transpose.2 = bf16[1,1024,4,24576]{3,2,1,0} transpose(bitcast.72), dimensions={1,2,0,3}
}

fused_computation.3 {
  param_0.7 = bf16[1,1024,24576]{2,1,0} parameter(0)
  ROOT copy.19 = bf16[1,1024,24576]{2,1,0} copy(param_0.7)
}

fused_computation.4 {
  param_0.8 = bf16[] parameter(0)
  ROOT broadcast.4 = bf16[4096,24576]{1,0} broadcast(param_0.8), dimensions={}
}

fused_computation.5 {
  param_0.9 = u32[] parameter(0)
  ROOT copy.20 = u32[] copy(param_0.9)
}

ENTRY main.7_spmd {
  param.2 = bf16[1,1024,24576]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]}
  constant_8 = u32[] constant(0)
  constant_6 = bf16[] constant(0)
  param.3 = bf16[24576,24576]{1,0} parameter(1), sharding={devices=[1,4]<=[4]}
  wrapped_copy.17 = bf16[1,1024,24576]{2,1,0} fusion(param.2), kind=kLoop, calls=fused_computation.3
  wrapped_copy.18 = u32[] fusion(constant_8), kind=kLoop, calls=fused_computation.5
  bitcast.51 = bf16[1024,24576]{1,0} bitcast(wrapped_copy.17)
  wrapped_broadcast.3 = bf16[4096,24576]{1,0} fusion(constant_6), kind=kLoop, calls=fused_computation.4
  tuple.11 = (bf16[1024,24576]{1,0}, bf16[24576,24576]{1,0}, bf16[4096,24576]{1,0}, u32[]) tuple(bitcast.51, param.3, wrapped_broadcast.3, wrapped_copy.18)
  while.1 = (bf16[1024,24576]{1,0}, bf16[24576,24576]{1,0}, bf16[4096,24576]{1,0}, u32[]) while(tuple.11), condition=windowed_dot_general_cond.clone, body=windowed_dot_general_body.clone, backend_config={"known_trip_count":{"n":"4"}}
  get-tuple-element.7 = bf16[4096,24576]{1,0} get-tuple-element(while.1), index=2
  bitcast.60 = bf16[1,4,1024,24576]{3,2,0,1} bitcast(get-tuple-element.7)
  all-to-all-start = ((bf16[1,4,1024,24576]{3,2,0,1}), bf16[1,4,1024,24576]{3,2,0,1}) all-to-all-start(bitcast.60), channel_id=3, replica_groups={{0,1,2,3}}, dimensions={1}, backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"collective_backend_config":{"is_sync":true,"no_parallel_custom_call":false}}
  all-to-all-done = bf16[1,4,1024,24576]{3,2,0,1} all-to-all-done(all-to-all-start)
  fusion.1 = bf16[1,1024,4,24576]{3,2,1,0} fusion(all-to-all-done), kind=kLoop, calls=fused_computation.1
  ROOT bitcast.3 = bf16[1,1024,98304]{2,1,0} bitcast(fusion.1), frontend_attributes={fingerprint_before_lhs="4f74a97cf542689bb2316608ce3bcd4e"}
} // main.7_spmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Improve windowed einsum(collective matmul) for XLA:GPU #8865

{{title}}

Replies: 0 comments

Select a reply

[RFC] Improve windowed einsum(collective matmul) for XLA:GPU #8865

Tixxx Jan 26, 2024 Collaborator

Motivation

High-level Design

HLO Graph

Thunk Execution

Use AsyncStart and AsyncDone Ops with Synchronization thunk

Other Considerations

Appendix

Replies: 0 comments

Tixxx
Jan 26, 2024
Collaborator