-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized fused MoE Kernel, take 2 #2979
Optimized fused MoE Kernel, take 2 #2979
Conversation
@WoosukKwon This is now ready to review :) |
Hi @pcmoritz Could you provide the exact benchmark setup? I benchmarked this PR for Mixtral with different batch sizes (from 1 to 256) on 4 H100 GPUs, but didn't see any noticeable speedup. |
@WoosukKwon Thanks for trying it out! The PR only includes optimized settings for TP2 on H100 (and I also added the optimizations for TP4 on A100 80 GB that I have gathered since then) so there is no difference for TP4 on H100 vs. main. I added a README to explain this. For Mixtral it only really makes sense to run TP2 on H100. Instead of TP4 on H100, it is better to run two replicas of TP2 on H100 and split the traffic between them (i.e. for a given latency budged, that gives more throughput). On A100, TP4 is the optimal setting in my experience. |
@pcmoritz Got it. Thanks for the explanation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcmoritz Thanks for the PR! The speedup is really nice.
BTW, could you provide a script to tune the parameters and also to select the default config? Otherwise, I'd be happy to implement it. I have small experience in tuning Triton kernels.
@@ -0,0 +1,20 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a script to tune these parameters? No worries otherwise. I can implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, happy to add the script. The process of tuning is not fully automatic at the moment and requires some manual modifications, but I will contribute what I have 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the script benchmark_mixtral_moe.py
now to do the search -- in practice I wasn't using exactly this script but was doing some modifications of the script as I searched through the different batch sizes. But the script should be a good way to get started :)
Maybe by doing a more exhaustive search, we could even improve on these parameters, but I think the gap would be pretty small if we find something even better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I made all the updates now, PTAL :) |
Co-authored-by: Woosuk Kwon <[email protected]>
This is awesome thanks @pcmoritz! TP2 A100 would also be very useful, we can have a go at adding a config for that. |
@njhill Thanks, that would be very much appreciated, feel free to tag me in the PR :) |
Co-authored-by: Cade Daniel <[email protected]>
Do I understand it correctly, that I can use this PR to improve inference time on my cluster as this PR allows parameter search for optimal kernel parameters? If so, how? |
This replaces #2913
With some more aggressive parameter search, we were actually able to beat the TensorRT kernels even on small batch sizes, which is very fortunate since it reduces the complexity a lot.
Here are the results on this PR on H100 with TP2:
current main branch (untuned fused MoE kernel):
only using the TensorRT Moe kernels:
The code is structured in such a way that it is very easy to just drop a new .json file into the
fused_moe/configs
directory to support a different configuration.Co-authored-by: Cade Daniel [email protected]