Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yiakwy-xpu-ml-framework-team
Copy link
Contributor

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Jan 19, 2025

Motivation

Current cuda kernel runs best when work load fitted into a single block execution (when workload large enough triton has better performance since data is more evenly distributed to many blocks, and less possibility for two threads in warp accessing same bank by expert_id )

Before adapting the kernel to the multi-block exeuction. Existing kernel computes expert cumsum in thread 0. Cumsum is a actually very simliar to precan:

cumsum optimization drawio

For expert number (256 ) In previous single block execution, thread 0 (warp 0) process 256 elements; while in the load balance version, thread 15 has 16 + 16 = 32 elements to process. Hence wavefront is faster to reach.

Modifications

  • reduce cusum direct access , not too much positive effects, perhaps expert_num is too small to have influence [x]
  • load balance cumsum computation , ~25% boost for small batches; **no effect **for large batches (where existing triton implementation outperforms cuda)

old moe align kernel

截屏2025-01-18 13 48 00

new moe align kernel

截屏2025-01-19 08 28 41

Test Env

  • A100 GPU
  • Latest SGLang CUDA Env

Checklist

Next Step

  • (algorithm) Enable efficient multiple blocks execution and adjust threads blocks to large work loads
  • enable cuRadix with cub (suported by @BBuf )

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team changed the title optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads [MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads Jan 19, 2025
@BBuf
Copy link
Collaborator

BBuf commented Jan 19, 2025

Why is there such a large difference in the Triton benchmark results in the two screenshots above? On the same GPU, Triton's kernel latency should be relatively consistent. @yiakwy-xpu-ml-framework-team

@BBuf
Copy link
Collaborator

BBuf commented Jan 19, 2025

Based on the proportions in the screenshots above, pr's optimization seems to be slower than main branch ? For example, at bs=1 and seq=64, the CUDA implementation on the main branch takes approximately 26/203 = 0.128 of the time compared to Triton, but in your PR's test results, it shows 20.1/129.79 = 0.154. I'm not sure if you're running the same script on a same GPU. Additionally, the newly added code significantly reduces the readability of the original code. I believe that for small batch_sizes or small seq_length, we don't need to do any optimization. Instead, we should focus on increasing parallelism by using multiple blocks when dealing with large batch sizes or large seq_length now.

@BBuf
Copy link
Collaborator

BBuf commented Jan 19, 2025

@yiakwy-xpu-ml-framework-team However, your fix for the bug in the benchmark script is correct, and you can create a new PR for the script changes, I can merge it.

@yiakwy-xpu-ml-framework-team
Copy link
Contributor Author

yiakwy-xpu-ml-framework-team commented Jan 20, 2025

@BBuf yes triton differences caused by triton 3.2.0 (built from hand for debug); I have udpated result

old

截屏2025-01-20 10 28 19

new

截屏2025-01-20 10 20 13

and new version is indeed faster 25% at small bathes workloads. And I agree with you enabling multi-blocks efficient execution is more important. It is workign in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants