-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda: 1.2x faster dequantization kernel #2809
base: master
Are you sure you want to change the base?
Conversation
Have you tested to see if this breaks CUDA compatibility for AMD cards using the recently merged ROCm pull request? |
I tried it out:
Seems fine, identical results with a specific seed. Tested on a Q4_0 7B LLaMA1 model. No difference in speed that I can see, which isn't too surprising since like 90% of time is spent in matrix multiplication. |
ggml-cuda.cu
Outdated
|
||
dfloat2 dv0; | ||
dv0.x = (int)(qs.x & 0xf) - 8; | ||
dv0.y = (int)(qs.y & 0xf) - 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HIP/ROCm treats the x
and y
variables of a half2 as shorts, so I think this would work better, and then the same change for dv1 just below this.
#ifdef GGML_CUDA_F16
dv0 = __halves2half2((int)(qs.x & 0xf) - 8, (int)(qs.y & 0xf) - 8);
#else
dv0.x = (int)(qs.x & 0xf) - 8;
dv0.y = (int)(qs.y & 0xf) - 8;
#endif
edit: replaced make_half2
with __halves2half2
which has been part of the CUDA API for longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Introduced a make_dfloat2
macro to create the proper dfloat2 (half2 or float2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation behind this change? In which situation does the dequantization performance actually make a difference?
@JohannesGaessler I'm optimizing chatglm.cpp and find that dequantization kernels cost ~50% of the total time of context computing. I only test short prompts though. Here is the nsys profile. |
I would suggest you try a longer prompt, the |
When I tested it using the |
noob question - what is the prompt length and content of the llama-bench program? Shouldn't the performance depend on what the prompt is asking for? E.g., "write a 50 page script" whould be more tg heavy than "read this <50 page script> and summarize" right? |
|
I see. Thanks! So the |
The |
Thanks! That makes it a lot clearer now. |
With the recent performance improvement of |
I don't think it is worth to put any effort into this, we need to implement matrix multiplication kernels that can use tensor cores, ideally with integer operations. |
OK. For future reference, adding a data point for low-batch decoding on V100 (#3479)
|
Shouldn't that be handled by cublas/hipblas already? They should use tensor cores/WMMA. |
Yes, but writing our own kernels would allow us to do this without having to dequantize the entire matrix to main memory first, and to use INT8 instead of FP16. |
I see. Makes sense. For those kernels, I would highly recommend using mma.h (for CUDA/NVIDIA) and rocWMMA for AMD https://github.com/ROCmSoftwarePlatform/rocWMMA. This will make sure you have a single codebase for using tensorcores for both vendors. This will provide the additional benefit that it'll also run seamlessly on MI GPU's matrix cores (however you'd have to support wave64 mode.. shouldn't be too tricky though). The main caveat for RDNA3 support would be that it's restricted to 16x16x16 GEMM whereas NVIDIA may support other GEMM sizes. Keeping all matmuls to 16x16 will make the code portable, then you can specialize that afterword. |
Optimize cuda dequantization with memory coalescing, achieving 1.2x speed up. For now, I only implement the faster kernel for q4_0. If this PR gets accepted, I'll implement the rest.
I use nvprof to profile kernels on a V100-SXM2 GPU with 900GB/s memory bankwidth.
nvprof --print-gpu-trace ./main -m ./models/7B/ggml-model-q4_0.gguf -p "Hello" -n 8 -ngl 32 -nommq
Before this PR, dequantizing a 11008x4096 q4_0 weight matrix costs 332.10us:
With this PR, the same operation only costs 277.18us (1.2x speed up).
The memory bandwidth utilized reached (11008x4096x(0.5+4)B)/277.18us = 682GB/s (76% out of the peak 900GB/s), compared to the previous 569GB/s out of 900GB/s (63% peak).