-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads #2970
base: main
Are you sure you want to change the base?
Conversation
Why is there such a large difference in the Triton benchmark results in the two screenshots above? On the same GPU, Triton's kernel latency should be relatively consistent. @yiakwy-xpu-ml-framework-team |
Based on the proportions in the screenshots above, pr's optimization seems to be slower than main branch ? For example, at |
@yiakwy-xpu-ml-framework-team However, your fix for the bug in the benchmark script is correct, and you can create a new PR for the script changes, I can merge it. |
@BBuf yes triton differences caused by triton 3.2.0 (built from hand for debug); I have udpated result oldnewand new version is indeed faster 25% at small bathes workloads. And I agree with you enabling multi-blocks efficient execution is more important. It is workign in progress. |
76f0601
to
eb76e72
Compare
Motivation
Current cuda kernel runs best when work load fitted into a single block execution (when workload large enough triton has better performance since data is more evenly distributed to many blocks, and less possibility for two threads in warp accessing same bank by expert_id )
Before adapting the kernel to the multi-block exeuction. Existing kernel computes expert cumsum in thread 0. Cumsum is a actually very simliar to precan:
For expert number (256 ) In previous single block execution, thread 0 (warp 0) process 256 elements; while in the load balance version, thread 15 has 16 + 16 = 32 elements to process. Hence wavefront is faster to reach.
Modifications
old moe align kernel
new moe align kernel
Test Env
Checklist
Next Step