-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Cute GPTQ with asymmetric 4-bit encoding #1149
Comments
Hi! thanks a lot for your question.
|
So, it is. Thank you! |
This issue has been labeled |
Closing due to inactivity. Please reopen if needed |
@kroburg |
Hello. I don't store encoded data in smem. The pipeline is global -> load -> decode -> store in smem. Opensourcing is not feasible because at the moment of implementation cute had some issues with subbyte data types. And I fixed them in not very good manner. Since then cute got a large update on that functionality and my code should be rewritten. UPD I published just "kernel" (since all other stuff is not even compilable). May be it will help you to find some insights. https://github.com/kroburg/cute_gptq/blob/main/cute_gptq_70.hpp |
Hello.
I've implemented GPTQ with asymmetric 4-bit encoding using CuTe.
GPTQ is about matrices weights quantization using N-bit using scale and zero-point of group of (128) elements. Activations are in fp16 format. It works good in sense of model quality for medium models (7B, 33B).
During work I introduced some fixes into CuTe codebase which are (as I feel) not really good from design/architecture perspective.
Also I got boring performance. It takes 2.4ms for MNK 2048, 4096, 4096 vs 1.7ms with cuBlas (fp16xfp16) on 4090GTX. It looks like it will be much faster to dequantize initial weights into (temporal) fp16 and run cuBlas.
Examing Questions/PRs shows me that there is a demand for such GEMM. So I wish to contribute it somehow.
The questions are:
The text was updated successfully, but these errors were encountered: