Recipe for quantizing + LoRA-ing models #15959

davisyoshida · 2023-05-11T04:43:10Z

davisyoshida
May 11, 2023

I just put up a notebook which shows how to combine GPT-Q and LoRA for finetuning of JAX models. I posted my LoRA implementation here already, so the new thing is the GPT-Q script which makes it really easy to apply it to arbitrary functions.

The torch implementation of GPT-Q mostly looks like it relies on people handwriting a script to apply it to each new architecture (there's a repo just for applying it to LLaMA). In the JAX case I was able to write a single implementation which works for everything from a single matmul to HuggingFace's GPT-2 and T5 models.

Combining GPT-Q and LoRA for something like LLaMA lets you store the large number of model parameters in 4 bits (so 6.5 GB for the 13B LLaMA), and then finetune a small (< 10M) number of parameters for finetuning. This recipe has been letting me finetune the 7B LLaMA successfully on a single 24GB GPU.

Basically all you need to do is:

def my_model(params, inputs):
    ...

quantized_params = quantize(my_model, params, list_of_batches_of_data)

freeze_params, train_params = lorax.init_lora(quantized_params, a_pytree_specifying_which_params_to_lora)

wrapped_model  = use_quantized(lora(my_model))

# The model can use quantized parameters and LoRA
wrapped_model((freeze_params, train_params), some_input)

Edit: Oh I should probably mentation that the GPT-Q code is pretty rough right now but I'm hoping to clean it up in the next couple weeks 😅. It's generally good enough for the stuff I've been using it for, but might not work for all use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe for quantizing + LoRA-ing models #15959

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Recipe for quantizing + LoRA-ing models #15959

davisyoshida May 11, 2023

Replies: 0 comments

davisyoshida
May 11, 2023