[MOE] Try to optimize moe align block size multiblocks cuda kernel #3137

yiakwy-xpu-ml-framework-team · 2025-01-26T03:13:45Z

Motivation

This is follow up of PR#2970 which enables efficient mutli-blocks execution.

The codes pass correctness test in many simple configurations (seq_len=16384):

The enbles distributed cusum for a single GPU (cudaLaunchCooperativeKernel) at large scale workload.

To make it possible, note Sum(F(a + b)) != Sum(F(a)) + Sum(b) , where F = floor((a + c - 1) / c) (we can make a = m*c + p, b = n * c + q, where q, p are both integers and ranging from [1, c) to prove this mathmatically ).

Algorithm Structure

The codes in kernel moe_align_block_size_multiblocks_kernel organized in the following manner :

// stage 1: compute local shared_counts
...
// stage 2: compute local unaligned cumsum (since bad property of 'F' function mentioned above) using 16x16 fragments in many warps and cache them to tokens_cnt 
...
    __threadfence_system();
    grid.sync();
// stage 3: compute global unaligned cumsum using our newly introduced distributed cumsum algorithm in https://github.com/sgl-project/sglang/pull/2970
{
   ...
    __threadfence_system();
    grid.sync();
}
// stage 4: compute global aligned cumsum and store back to cumsum_ptr
...
// stage 5: compute expert_ids, sorted_ids
...

The overal steps are very similar to what in this triton version : PR#2913

Considering ROCM support, the stategy is using Ampere technique to develop this multi-blocks codes, no fancy features from arch above sm90 needed after my simple evaluation.

Modifications

We introduce cooperative_groups control valid since from Volta arch and typical strategy in Ampere arch :

#ifdef USE_ROCM
#include <hip/hip_runtime.h>
#include <hip/hip_cooperative_groups.h>
#else
#include <cooperative_groups.h>
#endif // USE_ROCM

After careful study, I believe 16x16 warp fragment is suitable for distributed cumsum computation. The numbers will benchmarked later.

Correctness

Benchmark for large scale data

(WIP)

Next Steps

Verify in MI300X @HaiShaw
refactor codes to make them simple and nicer :
- using flashinfer vec_t to to 128 bit copy for int32_t (int16_t may also possible)
- cutlass as replacement

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling.

BBuf · 2025-01-27T07:05:41Z

sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.cu

+    }
+
+    // NOTE (yiakwy) : step 2, loop tail
+    if (tid == active_threads - 1) {


Can you remove your balance threads optimization? The current version of the code is too complex, and premature optimization is the root of all evil. I suggest removing this part first to expose the core modifications. Additionally, the kernel for a single block should be retained and the current code should be enabled when the number of tokens is greater than or equal to 32768, unless this multi-block kernel outperforms in all cases.

Yep , enabling mulit-blocks upon existing kernel implementation is on the way.

yiakwy-xpu-ml-framework-team added 3 commits January 20, 2025 19:48

optimize moe align blocks kernel single block execution

428923d

rebase main

eb76e72

add efficient moe multi blocks exectuion

59c93cf

yiakwy-xpu-ml-framework-team requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077 and merrymercy as code owners January 26, 2025 03:13

yiakwy-xpu-ml-framework-team marked this pull request as draft January 26, 2025 03:14

yiakwy-xpu-ml-framework-team mentioned this pull request Jan 26, 2025

hipLaunchCooperativeKernel slowdown ROCm/ROCm#3410

Open

BBuf reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MOE] Try to optimize moe align block size multiblocks cuda kernel #3137

[MOE] Try to optimize moe align block size multiblocks cuda kernel #3137

yiakwy-xpu-ml-framework-team commented Jan 26, 2025 •

edited

Loading

BBuf Jan 27, 2025

yiakwy-xpu-ml-framework-team Jan 27, 2025

[MOE] Try to optimize moe align block size multiblocks cuda kernel #3137

Are you sure you want to change the base?

[MOE] Try to optimize moe align block size multiblocks cuda kernel #3137

Conversation

yiakwy-xpu-ml-framework-team commented Jan 26, 2025 • edited Loading

Motivation

Algorithm Structure

Modifications

Correctness

Benchmark for large scale data

Next Steps

Checklist

BBuf Jan 27, 2025

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Jan 27, 2025

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team commented Jan 26, 2025 •

edited

Loading