-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support w8a8 fp8 kernel with CUTLASS #3047
Conversation
clean code
c70943d
to
cd51083
Compare
We have fixed the review issues and resolved the conflicts. And we also tried to optimize the performance on sm90, but it can't still overcome vllm under all cases. The final results shows that our kernel and vllm's have their own advantages in different cases. |
53fd85a
to
a1b582e
Compare
why it fails when I run 'pip install .' in sgl-kernel dir? |
@ll2088 Please run
|
|
@ll2088 build-wheels CI works well, so I think the issue is caused by your local environment. |
which version of flashinfer are you using? |
@HandH1998 Please paste the latest benchmark results. Thanks! |
Support sm89 and sm90 fp8 GEMM implementation with cutlass for w8a8 fp8 quantization. Co-author @yych0745 @b0urnee
Benchmark
GPU: sm89-L40
meta-llama/Llama-3.1-8B-Instruct, TP=1
meta-llama/Llama-3.3-70B-Instruct, TP=1
mistralai/Mistral-Large-Instruct-2407, TP=1
Qwen/Qwen2.5-7B-Instruct, TP=1
Qwen/Qwen2.5-32B-Instruct, TP=1
Qwen/Qwen2.5-72B-Instruct, TP=1
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=1
meta-llama/Llama-3.1-8B-Instruct, TP=4
meta-llama/Llama-3.3-70B-Instruct, TP=4
mistralai/Mistral-Large-Instruct-2407, TP=4
Qwen/Qwen2.5-7B-Instruct, TP=4
Qwen/Qwen2.5-32B-Instruct, TP=4
Qwen/Qwen2.5-72B-Instruct, TP=4
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=4
meta-llama/Llama-3.1-8B-Instruct, TP=8
meta-llama/Llama-3.3-70B-Instruct, TP=8
mistralai/Mistral-Large-Instruct-2407, TP=8
Qwen/Qwen2.5-7B-Instruct, TP=8
Qwen/Qwen2.5-32B-Instruct, TP=8
Qwen/Qwen2.5-72B-Instruct, TP=8
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=8
GPU: sm90-H100
meta-llama/Llama-3.1-8B-Instruct, TP=1
meta-llama/Llama-3.3-70B-Instruct, TP=1
mistralai/Mistral-Large-Instruct-2407, TP=1
Qwen/Qwen2.5-7B-Instruct, TP=1
Qwen/Qwen2.5-32B-Instruct, TP=1
Qwen/Qwen2.5-72B-Instruct, TP=1
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=1
meta-llama/Llama-3.1-8B-Instruct, TP=4
meta-llama/Llama-3.3-70B-Instruct, TP=4
mistralai/Mistral-Large-Instruct-2407, TP=4
Qwen/Qwen2.5-7B-Instruct, TP=4
Qwen/Qwen2.5-32B-Instruct, TP=4
Qwen/Qwen2.5-72B-Instruct, TP=4
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=4
meta-llama/Llama-3.1-8B-Instruct, TP=8
meta-llama/Llama-3.3-70B-Instruct, TP=8
mistralai/Mistral-Large-Instruct-2407, TP=8
Qwen/Qwen2.5-7B-Instruct, TP=8
Qwen/Qwen2.5-32B-Instruct, TP=8
Qwen/Qwen2.5-72B-Instruct, TP=8
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=8