Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support 2bit quip# method #1293

Closed
Minami-su opened this issue Dec 23, 2023 · 25 comments
Closed

support 2bit quip# method #1293

Minami-su opened this issue Dec 23, 2023 · 25 comments
Labels
PRs welcome to address this contributions are welcome from community members on this issue

Comments

@Minami-su
Copy link

https://github.com/Cornell-RelaxML/quip-sharp

@Minami-su
Copy link
Author

image

@younesbelkada younesbelkada added the PRs welcome to address this contributions are welcome from community members on this issue label Dec 28, 2023
@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 28, 2023

Hi!
We are definitely interested to add Quip# inference support in HF ecosystem similarly as GPTQ, AWQ, etc.!
tagging one of the main author of Quip# here: @tsengalb99 - what is the current canonical way to use Quip# kernels, is there a package on pypi with pre-compiled kernels for users to run inference ? We can also support inference with kernels that needs to be manually built by users (e.g. for llm-awq package (AWQ) we defined a variable "backend" in the quantization config:https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L58 and users switch from different backend if they use the kernels from the official repository or autoawq - that way if in the future there is a package that stores compiled kernels for Quip# we can support that easily by just swapping the backend)

cc @SunMarc @Titus-von-Koeller FYI

@tsengalb99
Copy link

The canonical way to install QuIP# kernels is to install the fast-hadamard-transform package and build quiptools (in our codebase on github). We do not have a pypi package yet but are planning on having one in the future when the project becomes more stable. The two key "linear" classes that QuIP# relies on are here https://github.com/Cornell-RelaxML/quip-sharp/tree/main/lib/linear and you can see how we replace nn.Linear in llama with those classes in https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama.py.

A few questions: how are you planning on integrating QuIP# into huggingface code? Where will it be integrated into and how will you keep up with future itertations on QuIP#?

@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 29, 2023

Hi @tsengalb99
Thanks for your response! I am not 100% familiar yet with Quip# but what I had in mind was to go for a similar approach than AWQ, i.e. replacing torch.nn.Linear layers with QuantizedLinear from Quip# codebase. The core code would live inside a file quip.py here and we would replace the linear layers at init before loading the weights. Specifically, we would support Quip# inference and not quantization, for quantizing with Quip# we will redirect users to use your codebase.
To detect whether a model has been quantized with quip# we will add an attribute quantization_config in the config object, e.g. :https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-AWQ/blob/main/config.json#L20 and retrieve all neceassry arguments from there

@tsengalb99
Copy link

I think it would be best to avoid duplicating code from the QuIP# codebase. The QuantizedLinear class is not standalone and relies on implementations in the codebook files (eg here for E8P https://github.com/Cornell-RelaxML/quip-sharp/blob/1d6e3c2d4c144eba80b945cca5429ce8d79d2cec/lib/codebook/latticee8_padded12.py#L180), which means you'll have to copy all those over as well. QuIP# is still in active development and we will almost certainly make changes to the codebooks in the future that will require you to update your copies as well. Perhaps you can include QuIP# as a submodule or something similar so users only have to pull our code once.

@huggingface huggingface deleted a comment from github-actions bot Jan 29, 2024
@younesbelkada
Copy link
Contributor

We still plan to support Quip# inference, @tsengalb99 I will provide more details once huggingface/transformers#28703 gets merged

@younesbelkada
Copy link
Contributor

Hi @tsengalb99 !
Great news 🎉 - we just merged: huggingface/transformers#26610 to enable developers to easily add new quantization method inference support in HF transformers! Would you like to try your hands on integrating quip# inference support in HF transformers? There is a detailed guideline on how to get started here: https://huggingface.co/docs/transformers/main/en/hf_quantizer Let us know what do you think !

@tsengalb99
Copy link

tsengalb99 commented Feb 2, 2024 via email

@younesbelkada
Copy link
Contributor

Awesome, thanks very much @tsengalb99 !

@younesbelkada
Copy link
Contributor

Hi @tsengalb99 ! Let me know if you need any help to kickoff Quip# integration in transformers! 🙏 With the recent quantizer support it should be quite straightforward and I am happy to help if needed

@tsengalb99
Copy link

tsengalb99 commented Feb 13, 2024 via email

@younesbelkada
Copy link
Contributor

Awesome, thanks so much @tsengalb99 let me know if you face into any issue!

@tsengalb99
Copy link

@younesbelkada we've finally started working on this, expect some progress in a week or so.

@younesbelkada
Copy link
Contributor

Nice, thanks very much ! Let me know if you need any help or guidance !
You could take some inspiration from: huggingface/transformers#28928 to get started ! 🙏 Thanks again and looking forward to the PR ! 🚀

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@tsengalb99
Copy link

We are still working on integration, albeit very slowly.

@younesbelkada
Copy link
Contributor

thanks again @tsengalb99 ! 🚀

@Minami-su
Copy link
Author

AQLM is already fine #1476

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@SunMarc SunMarc reopened this Apr 29, 2024
@github-actions github-actions bot closed this as completed May 7, 2024
@SunMarc SunMarc reopened this May 7, 2024
@SunMarc SunMarc reopened this May 16, 2024
@SunMarc SunMarc reopened this May 27, 2024
@github-actions github-actions bot closed this as completed Jun 4, 2024
@tsengalb99
Copy link

tsengalb99 commented Jun 4, 2024 via email

@SunMarc
Copy link
Member

SunMarc commented Jun 5, 2024

Thanks for the update @tsengalb99 ! Very excited for this new methods 🔥 Would you mind explaining a bit more why cuda graphs are needed ? Also, in general, do you have any recommendation on what to improve on transformers to allow better support of quantization methods ?

@tsengalb99
Copy link

tsengalb99 commented Jun 6, 2024 via email

@ArthurZucker
Copy link
Collaborator

Cuda graphs are supported in transformers for models that support static kv cache

@tsengalb99
Copy link

tsengalb99 commented Jul 6, 2024 via email

@ArthurZucker
Copy link
Collaborator

The compile should be run on the forward not generate for now! huggingface/transformers#30788 will add end to end spport

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PRs welcome to address this contributions are welcome from community members on this issue
Projects
None yet
Development

No branches or pull requests

5 participants