-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AQLM support (experimental) #5466
Conversation
Hi @oobabooga, is the blacksamorez aqlm model an official aqlm model (do they have a repo?) or someone's attempt at quantizing with their code? I've been trying to find an officially released aqlm model but haven't been able to, and the aqlm paper is lacking some important details. The numbers in your table seem to indicate aqlm does worse than old quip#, which is contrary to what the aqlm arxiv claims. On the quip# side, we do have some new and significantly improved 2, 3, and 4 bit models that I'm going to announce later this week. We've also had better 3 and 4 bit models for a while now under "E8PRVQ" on huggingface (iirc you expressed interest in this when we first announced quip#) but I never got around to announcing those. |
@tsengalb99 those models are included in the Google Colab notebook linked in the README for AQLM, so I think they are official: https://github.com/Vahe1994/AQLM My numbers are not enought for a conclusion as the dataset is small (some 10 samples of 1200 tokens). I haven't been able to do a bigger wikitext test so far. It's exciting to hear that you have released better quantized models and that better ones are to come. I hope to be able to compare everything and find the Pareto frontiers some time this month. |
Hi @oobabooga and @tsengalb99 , One of the AQLM authors here.
Cheers, |
Disabling |
@oobabooga btw I just updated the quip-sharp repo with the latest code. The latest models are on HF and preprint is on arxiv as well. |
Thanks @tsengalb99. So, according to your data, AQLM does not surpass QuIP#, or at least the updated QuIP#. |
Correct. It looks like AQLM also had some updated numbers for ICML vs what we had in our preprint but the latest QuIP# should still be better. I think the important thing right now is getting CUDA graphs to work with HF b/c without CUDA graphs both methods spend most of their time on kernel launches. Arthur merged his PR but at least with the way I was using CUDA graphs, the latest transformers still doesn't work. Need to look into this more. |
Main with compile is broken, huggingface/transformers#28937 should fix it ! |
I have managed to test That's not very informative, but there you go. It at least tells me that the performance of (old?) AQLM is pretty impressive, as old QuIP# was already very good. |
The updated QuIP# models are under the same model cards on HF so if you get bored you should be able to rerun eval on new QuIP# by just calling the same command since HF will redownload new models. |
Hi @oobabooga! |
Thanks for the info @BlackSamorez. I still haven't had time to do a through perplexity comparison, as there are many new methods now, including llama.cpp with calibration (imatrix), HQQ, EXL2 (updated a few months ago), AQLM, and updated QuIP# (@tsengalb99). Methods with calibration in particular require a preliminary study on what calibration dataset to use. In any case, since AQLM is now fully integrated with the transformers library, I will merge this PR that just adds the aqlm requirement so that models available at https://huggingface.co/ISTA-DASLab can be loaded. |
This method claims to be better than QuIP# at 2-bit accuracy: https://arxiv.org/abs/2401.06118
Not much had to be added, as fortunately the authors integrated it HF transformers through custom Python code provided with the model checkpoint.In the latest transformers version, AQLM is fully integrated, so all this PR does is add the aqlm requirement. AQLM models should be loaded with the transformers loader.
Quantized models can be found at: https://huggingface.co/ISTA-DASLab
Old description
All that needs to be done is install the requirements and load the model with `--trust-remote-code`.I also had to disable
'low_cpu_mem_usage': True
.Example
Perplexity
On a small test that I have been running since the beginning of
thislast year to compare different quantizations (same as in #4803):The
1x16
variant is probably the best one, but I couldn't evaluate it due to lack of memory.