[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970

yiakwy-xpu-ml-framework-team · 2025-01-19T00:31:13Z

Motivation

Current cuda kernel runs best when work load fitted into a single block execution (when workload large enough triton has better performance since data is more evenly distributed to many blocks, and less possibility for two threads in warp accessing same bank by expert_id )

Before adapting the kernel to the multi-block exeuction. Existing kernel computes expert cumsum in thread 0. Cumsum is a actually very simliar to precan:

For expert number (256 ) In previous single block execution, thread 0 (warp 0) process 256 elements; while in the load balance version, thread 15 has 16 + 16 = 32 elements to process. Hence wavefront is faster to reach.

Modifications

reduce cusum direct access , not too much positive effects, perhaps expert_num is too small to have influence [x]
load balance cumsum computation , ~25% boost for small batches; **no effect **for large batches (where existing triton implementation outperforms cuda)

old moe align kernel

new moe align kernel

Test Env

A100 GPU
Latest SGLang CUDA Env

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling.

Next Step

(algorithm) Enable efficient multiple blocks execution and adjust threads blocks to large work loads
enable cuRadix with cub (suported by @BBuf )

BBuf · 2025-01-19T13:17:26Z

Why is there such a large difference in the Triton benchmark results in the two screenshots above? On the same GPU, Triton's kernel latency should be relatively consistent. @yiakwy-xpu-ml-framework-team

BBuf · 2025-01-19T13:38:51Z

Based on the proportions in the screenshots above, pr's optimization seems to be slower than main branch ? For example, at bs=1 and seq=64, the CUDA implementation on the main branch takes approximately 26/203 = 0.128 of the time compared to Triton, but in your PR's test results, it shows 20.1/129.79 = 0.154. I'm not sure if you're running the same script on a same GPU. Additionally, the newly added code significantly reduces the readability of the original code. I believe that for small batch_sizes or small seq_length, we don't need to do any optimization. Instead, we should focus on increasing parallelism by using multiple blocks when dealing with large batch sizes or large seq_length now.

BBuf · 2025-01-19T13:42:33Z

@yiakwy-xpu-ml-framework-team However, your fix for the bug in the benchmark script is correct, and you can create a new PR for the script changes, I can merge it.

yiakwy-xpu-ml-framework-team · 2025-01-20T02:33:45Z

@BBuf yes triton differences caused by triton 3.2.0 (built from hand for debug); I have udpated result

old

new

and new version is indeed faster 25% at small bathes workloads. And I agree with you enabling multi-blocks efficient execution is more important. It is workign in progress.

sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.cu

yiakwy-xpu-ml-framework-team requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077 and merrymercy as code owners January 19, 2025 00:31

yiakwy-xpu-ml-framework-team changed the title ~~optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads~~ [MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads Jan 19, 2025

BBuf reviewed Jan 20, 2025

View reviewed changes

sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.cu Show resolved Hide resolved

yiakwy-xpu-ml-framework-team requested a review from BBuf January 20, 2025 03:16

yiakwy-xpu-ml-framework-team mentioned this pull request Jan 20, 2025

fix deepseekv3 moe align blocks benchmark #3003

Merged

4 tasks

yiakwy-xpu-ml-framework-team added 2 commits January 20, 2025 19:48

optimize moe align blocks kernel single block execution

428923d

rebase main

eb76e72

yiakwy-xpu-ml-framework-team force-pushed the try_to_optimize_moe_align_block_size_cuda_kernel branch from 76f0601 to eb76e72 Compare January 20, 2025 12:01

yiakwy-xpu-ml-framework-team mentioned this pull request Jan 26, 2025

[MOE] Try to optimize moe align block size multiblocks cuda kernel #3137

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970

[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970

yiakwy-xpu-ml-framework-team commented Jan 19, 2025 •

edited

Loading

BBuf commented Jan 19, 2025

BBuf commented Jan 19, 2025 •

edited

Loading

BBuf commented Jan 19, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 20, 2025 •

edited

Loading

[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970

Are you sure you want to change the base?

[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970

Conversation

yiakwy-xpu-ml-framework-team commented Jan 19, 2025 • edited Loading

Motivation

Modifications

old moe align kernel

new moe align kernel

Test Env

Checklist

Next Step

BBuf commented Jan 19, 2025

BBuf commented Jan 19, 2025 • edited Loading

BBuf commented Jan 19, 2025 • edited Loading

yiakwy-xpu-ml-framework-team commented Jan 20, 2025 • edited Loading

old

new

yiakwy-xpu-ml-framework-team commented Jan 19, 2025 •

edited

Loading

BBuf commented Jan 19, 2025 •

edited

Loading

BBuf commented Jan 19, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jan 20, 2025 •

edited

Loading