-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LoftQ docs and tests #1532
Fix LoftQ docs and tests #1532
Conversation
Relates to huggingface#1525 Unfortunately, the docs I wrote about how to use LoftQ were incorrect, based on a misunderstanding I had. In reality, it is quite a bit more involved to get LoftQ working, requiring a complete roundtrip first loading a non-quantized model with LoftQ, saving the LoRA weights and the modified base model, loading the just stored base model again but this time with quantization, and finally loading the LoftQ-initialized adapter on top. The docs now link to the example which demosthenes how to move through these steps.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@yxli2123 As described above, I made a mistake in how LoftQ needs to be used. This PR adjust the docs and the unit tests to apply LoftQ correctly (from my understanding). However, after I changed the unit tests, 8 of them started failing, only 4 passed:
In these unit tests, I measure the model output from the base model without any quantization, then the model output of the bnb-quantized model, and finally the model output of the quantized model using LoftQ. The expectation is that the model output with LoftQ should be closer to the base model output than the bnb-quantized model output (without LoftQ). (Note that I'm only testing the LoftQ should lead to closer results, even if it's only marginally closer. Ideally, however, we want it be X% better, i.e. there should be a certain margin) I'm not sure why the error with LoftQ is greater and not smaller. Am I still applying LoftQ incorrectly? Please check the unit test if I make some mistake there. For debugging purposes, I added a check that on each individual layer, the residual error does indeed decrease when applying the LoRA weights initialized with LoftQ, so this part seems to be correct. Moreover, while working on this, some questions came up:
|
Related to huggingface#1532 At the moment, using LoftQ is quite cumbersome, as shown in this example: https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning Essentially, users have to: 1. Load the non-quantized model with LoftQ (which can be quite huge) 2. Modify the PEFT config 3. Save the adapter 4. Unwrap the base model with custom functions 5. Save the base model with modified weights (i.e. a whole copy of the base model) 6. Load the base model from step 5 with bnb quantization 7. Load the adapter from step 3 Yes, there is a helper script to do this, but this still has the advantage that we need to load the non-quantized model and that we have to create a completely new model checkpoint with the modified weights. This PR aims to make this process more convenient by adding a single function replace_lora_weights_loftq. This function takes the bnb-quantized LoRA model as input. Then it goes through each module with LoRA weights, lazily loads the corresponding non-quantized weights one at a time using safetensors, computes the quantization error, and replaces the LoRA weights with LoftQ-initialized LoRA weights. This is much more convenient because we only require very little extra memory thanks to lazy loading, and we don't have to keep an extra copy of the weights. While working on this, I still found that LoftQ initialization often did not seem to help a lot, as mentioned in huggingface#1532. I measured this by creating (1) logits with the base model, (2) with the quantized+LoRA model, and (3) with the quantized+LoRA+LoftQ model. The expectation is that (1) should be closer to (3) than to (2). This was often not the case. I therefore added the possibility to run a check each time that we replace a LoRA weight with the LoftQ weights. If this check returns True, we proceed to the next weight, otherwise we discard the change. That way, we only make the replacement with LoftQ weights if we see a real improvement. Of course, this is only a form of greedy optimization, but it seems to work in practice. And since it's optional, users can choose not to use it. This PR is not yet finished since I ran into an issue with matching the key names from safetensors not matching. Furthermore, for now this doesn't support 8bit quantization and the num_iter arguments of LoftQ, which I'm not sure is really working. However, I guess the replace_lora_weights_loftq function could be called multiple times in a row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Benjamin. After going through the tests, the examples for LoftQ. I believe that the discrepancy and failing tests happen due to the casting of activation to float16
by bitsandbytes when loading the loftq base and adapters.
So, the difference between original weight W and quantized+adapters weights (Q+BA) would be less when compared to just using Q, it isn't reflecting in logits because activations at each layer are being casted to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @BenjaminBossan for all the work on rectifying LoftQ tests and uncovering best practices for applying it! 🤗
@@ -44,6 +44,8 @@ config = LoraConfig(init_lora_weights=False, ...) | |||
|
|||
When quantizing the base model for QLoRA training, consider using the [LoftQ initialization](https://arxiv.org/abs/2310.08659), which has been shown to improve performance when training quantized models. The idea is that the LoRA weights are initialized such that the quantization error is minimized. To use LoftQ, follow [these instructions](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning). | |||
|
|||
In general, for LoftQ to work best, it is recommended to target as many layers with LoRA as possible, since those not targeted cannot have LoftQ applied. This means that passing `LoraConfig(..., target_modules="all-linear")` will most likely give the best results. Also, you should use `nf4` as quant type in your quantization config when using 4bit quantization, i.e. `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice point to note in the docs.
Related to #1532 At the moment, using LoftQ is quite cumbersome, as shown in this example: https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning Essentially, users have to: 1. Load the non-quantized model with LoftQ (which can be quite huge) 2. Modify the PEFT config 3. Save the adapter 4. Unwrap the base model 5. Save the base model with modified weights (i.e. a whole copy of the base model) 6. Load the base model from step 5 with bnb quantization 7. Load the adapter from step 3 Yes, there is a helper script to do this, but this still has the advantage that we need to load the non-quantized model and that we have to create a completely new model checkpoint with the modified weights. This PR aims to make this process more convenient by adding a single function replace_lora_weights_loftq. This function takes the bnb-quantized LoRA model as input. Then it goes through each module with LoRA weights, lazily loads the corresponding non-quantized weights one at a time using safetensors, computes the quantization error, and replaces the LoRA weights with LoftQ-initialized LoRA weights. This is much more convenient because we only require very little extra memory thanks to lazy loading, and we don't have to keep an extra copy of the weights. While working on this, I still found that LoftQ initialization often did not seem to help a lot, as mentioned in #1532. I measured this by creating (1) logits with the base model, (2) with the quantized+LoRA model, and (3) with the quantized+LoRA+LoftQ model. The expectation is that (1) should be closer to (3) than to (2). This was often not the case. I therefore added the possibility to run a check each time that we replace a LoRA weight with the LoftQ weights. If this check returns True, we proceed to the next weight, otherwise we discard the change. That way, we only make the replacement with LoftQ weights if we see a real improvement. Of course, this is only a form of greedy optimization, but it seems to work in practice. And since it's optional, users can choose not to use it. This doesn't support 8bit quantization and the num_iter arguments of LoftQ. However, the replace_lora_weights_loftq function can be called multiple times in a row for slightly improved results. --------- Co-authored-by: Steven Liu <[email protected]>
Relates to #1525
Don't merge this, some GPU tests are failing
Unfortunately, the docs I wrote about how to use LoftQ were incorrect, based on a misunderstanding I had. In reality, it is quite a bit more involved to get LoftQ working, requiring a complete roundtrip first loading a non-quantized model with LoftQ, saving the LoRA weights and the modified base model, loading the just stored base model again but this time with quantization, and finally loading the LoftQ-initialized adapter on top. The docs now link to the example which demonstrates how to move through these steps.
The unit tests have been adjusted to go through these same steps but now most of them fail, i.e. the quantization error is greater with LoftQ than without. This needs to be investigated.