Recipe for quantizing + LoRA-ing models #15959
Unanswered
davisyoshida
asked this question in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I just put up a notebook which shows how to combine GPT-Q and LoRA for finetuning of JAX models. I posted my LoRA implementation here already, so the new thing is the GPT-Q script which makes it really easy to apply it to arbitrary functions.
The torch implementation of GPT-Q mostly looks like it relies on people handwriting a script to apply it to each new architecture (there's a repo just for applying it to LLaMA). In the JAX case I was able to write a single implementation which works for everything from a single matmul to HuggingFace's GPT-2 and T5 models.
Combining GPT-Q and LoRA for something like LLaMA lets you store the large number of model parameters in 4 bits (so 6.5 GB for the 13B LLaMA), and then finetune a small (< 10M) number of parameters for finetuning. This recipe has been letting me finetune the 7B LLaMA successfully on a single 24GB GPU.
Basically all you need to do is:
Edit: Oh I should probably mentation that the GPT-Q code is pretty rough right now but I'm hoping to clean it up in the next couple weeks 😅. It's generally good enough for the stuff I've been using it for, but might not work for all use cases.
Beta Was this translation helpful? Give feedback.
All reactions