Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Cute GPTQ with asymmetric 4-bit encoding #1149

Closed
kroburg opened this issue Oct 17, 2023 · 6 comments
Closed

[QST] Cute GPTQ with asymmetric 4-bit encoding #1149

kroburg opened this issue Oct 17, 2023 · 6 comments

Comments

@kroburg
Copy link
Contributor

kroburg commented Oct 17, 2023

Hello.

I've implemented GPTQ with asymmetric 4-bit encoding using CuTe.

GPTQ is about matrices weights quantization using N-bit using scale and zero-point of group of (128) elements. Activations are in fp16 format. It works good in sense of model quality for medium models (7B, 33B).

During work I introduced some fixes into CuTe codebase which are (as I feel) not really good from design/architecture perspective.

Also I got boring performance. It takes 2.4ms for MNK 2048, 4096, 4096 vs 1.7ms with cuBlas (fp16xfp16) on 4090GTX. It looks like it will be much faster to dequantize initial weights into (temporal) fp16 and run cuBlas.

Examing Questions/PRs shows me that there is a demand for such GEMM. So I wish to contribute it somehow.

The questions are:

  1. What is the best way to submit my fixes for subbyte support?
  2. Those fixes are tightly connected to my GemmUniveral and MixedMma mainloop. What about new example?
  3. Can you please help with performance issue?
@thakkarV
Copy link
Collaborator

Hi! thanks a lot for your question.

  1. We have updates to subbyte iterator support in CuTe landing with the 3.3 release in the coming weeks. I would prefer you you wait until they come out and see if they resolve your issues. If they do not, a PR rebased on top of 3.3 will be welcome :)
  2. A new example would be the best way to go about it. That said, we have mixed input GEMMs with dequant fused into the mainloop coming with the 3.3 release as well that will be a first class citizen of the 3.x API. Please see #1134
  3. Sure. Have you compared your perf against the existing Ampere dequnt fused kernel implementations?

@kroburg
Copy link
Contributor Author

kroburg commented Nov 2, 2023

So, it is. Thank you!
I will explore new codebase and rebase.

Copy link

github-actions bot commented Dec 2, 2023

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@mnicely
Copy link
Collaborator

mnicely commented Dec 5, 2023

Closing due to inactivity. Please reopen if needed

@mnicely mnicely closed this as completed Dec 5, 2023
@jeromeku
Copy link
Contributor

jeromeku commented Mar 9, 2024

@kroburg
Very interested in subbyte quantization implementations using Cutlass 3.x / CuTe. Would you consider open sourcing your GPTQ implementation? Curious how you handled the smem layout / register shuffling to get the right thread / value when dealing with 4 bit types...

@kroburg
Copy link
Contributor Author

kroburg commented Mar 11, 2024

@kroburg Very interested in subbyte quantization implementations using Cutlass 3.x / CuTe. Would you consider open sourcing your GPTQ implementation? Curious how you handled the smem layout / register shuffling to get the right thread / value when dealing with 4 bit types...

Hello.

I don't store encoded data in smem. The pipeline is global -> load -> decode -> store in smem.
So smem layout is ordinary fp16 layout.

Opensourcing is not feasible because at the moment of implementation cute had some issues with subbyte data types. And I fixed them in not very good manner. Since then cute got a large update on that functionality and my code should be rewritten.

UPD I published just "kernel" (since all other stuff is not even compilable). May be it will help you to find some insights. https://github.com/kroburg/cute_gptq/blob/main/cute_gptq_70.hpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants