-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AQLM Quantization #5465
Comments
Performance is pretty bad, but it may become better after huggingface/transformers#27931. |
I think at this time exl2 format is most attractive and may become most popular. GPTQ is currently unavoidable because that's the only format which supports training. |
Dunno about that man, have you checked the new IQ2/IQ3_XSS quants? (Early stages and I haven't tried it but the ppl seems promising! Though for now I use exl2 because it's just so fast.) |
yes that's exactly why it is most attractive as it is running on exlama engine which is in order of magnitude faster than anything else and allows huge context sizes. At this time I just don't see any usable alternatives. |
This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment. |
It seems that AQLM is supported by now, but there's no way to use it on Windows, because of aqlm[gpu] dependency which requires triton, triton not being available on Windows. @oobabooga, can you confirm? |
Description
AQLM (GitHub, Paper, Reddit discussion) is a novel quantization method that focuses on 2-2.5 bit and claims to surpass QuiP# and allows for a 70b to run on a 3090 with surprisingly good PPL (allegedly), and even 3-bit GPTQ
Additional Context
According to my high-accuracy crystal ball which I bought from The Onion a decade ago, TheBloke will ignore this and continue to release half a dozen quants of GPTQ per model until late 2026, no matter what else dethrones it.
The text was updated successfully, but these errors were encountered: