-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoints are the full base_model and not just the lora model #353
Comments
I am also seeing this same issue when generating checkpoints with LoRA. All my checkpoints contain what appear to be the full model weights (I assume it is the merged LoRA + full model weights) and no configuration files to actually run the checkpoint.
Is what is generated in the checkpoint directories. I tried directly using the "full model weights" in the pytorch_model.bin and it does not work. How do we extract the LoRA adapter from this file or get the checkpoint to be saved as a LoRA configured adapter that can be fun for inference. Did you find any solution @winglian ? |
Edit: There's good reason to believe that the code below does not work as expected - I'm leaving it for context, but I recommend trying the I'm not sure whether this is intended behavior or not, but personally I do something like this to grab the adapter from a Trainer checkpoint: import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model, set_peft_model_state_dict
BASE_MODEL = "/data/your-base-model-path-here"
OUTPUT_DIR = "/data/your-peft-adapter-will-go-here"
STATE_DICT = "/data/your-checkpoint-folder-here/pytorch_model.bin"
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)
# This needs to match your training configuration _exactly_.
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=64,
lora_alpha=32,
lora_dropout=0.05,
)
model = get_peft_model(model, peft_config)
full_state_dict = torch.load(STATE_DICT, map_location="cpu")
set_peft_model_state_dict(model, full_state_dict)
model.save_pretrained(OUTPUT_DIR) Works with LLaMA trained with DeepSpeed ZeRO 1. If doing model sharding (FSDP, ZeRO 3) you might need to make some changes, but the general gist is: get the PyTorch module (the |
Hello, @winglian , I found this post by a collaborator telling to use callbacks in Trainer to save model: #286 (comment) @0x000011b , does the lora work well for you? I have tried it (made sure same config), ran successfully, but my results were quite bad. I'm not sure if it's due to my training or this method ? I checked the keys whether they match get_peft_model_state_dict(model).keys()
# keys start with _orig_mod
# dict_keys(['_orig_mod.base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight', '_orig_mod.base_model.model.model.layers.0.self_attn.q_proj..
full_state_dict.keys()
# odict_keys(['base_model.model.model.embed_tokens.weight', 'base_model.model.model.layers.0.self_attn.q_proj... Looking at this, what if the lora is not properly merged due to keys mismatch? |
@NanoCode012 Regarding whether or not the keys match, an easy way to find out would be replacing: set_peft_model_state_dict(model, STATE_DICT) with: model.load_state_dict(STATE_DICT) This will skip a bunch of PEFT internals (which are probably there for a reason, hence I don't quite recommend it), but it's a useful test because it should output something like:
Which should let you know that everything is OK. However, I've been getting some subpar results while attempting to test a LoRA that I extracted via this method. I assumed this was down to poor training data or hparams - but @Rallio67 has also reported similar behavior, so perhaps there is indeed something wrong with this approach. Does the callback method work well for you? |
@0x000011b , I get
I remember the original tloen also has similar error but is ok to ignore. I'm not sure if it's the same thing. I will test the callback method. |
@NanoCode012 I've heard a lot of complaints about bugs and weird behaviors in the tloen repo recently so I'm not sure how much I'd trust that comment - if model weights are failing to load because of mismatched key names, I think something is indeed going wrong and it's not safe to ignore. If the training code you're using does more stuff to |
@0x000011b , I've tested using # Extract lora
08745c9d7cb8f38aebe64c538cd5dfe2cc22f5edcd333afc4c25efb875eee954 adapter_model.bin
# Resume then save_pretrained
8671810c23f7310fe1c1933cbb227dc405873476eb241ae99e5e7fa210efcff2 adapter_model.bin Note: If there is a bug with resume or if trainer modifies weight slightly, then this will invalidate results above. I want to try callback but I'm not sure how I can "force" a Edit: Loading the |
Thank you for your help looking into this. The code does work to generate an adaptor that PEFT can accept without errors, however the adaptor is corrupted in some way since the output when using the adaptor is no different than the untrained model. I am testing with t5-xl-lm and I evaluated the converted checkpoint at 350 steps with the final output (from the training script) and the output is all good with the model generated at the completion of the training script (352 steps in my case). |
@0x000011b , I have fine-tuned a simple model using callbacks (code here: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/callbacks.py) I have no idea if it's an implementation issue or some training issue, but all my The final one in the output folder is different ( after model trained ). How have your results fared?
@Rallio67 , yes, I suspect that something along this line happened as well. Did you mean to say, the final adapter works, but not the extracted one? |
@NanoCode012 if you get to the end of the trainer training loop using LoRA PEFT the final saved model (not any of the checkpoints) does work and gives good expected performance. I have no figured out a way to make any of the checkpoints work. |
@0x000011b @Rallio67 , I checked the source code and this method of "extracting" lora seems to exactly as the one used to load Lines 372 to 376 in b1059b7
In fact, this makes me suspect whether the below works in loading. I cannot find any code within this repo messing with the saving of checkpoints, so theoretically, it should load all weights properly, but what if the peft weights aren't loaded properly? trainer.train(resume_from_checkpoint=resume_from_checkpoint) # folder with pytorch_model.bin |
I redid a training for this. I had an issue with optimizer due to some code changes. I believe the callback does work. The result seem somewhat ok for what it's given (small dataset). |
@NanoCode012 I can confirm that on my end the callback does indeed seem to work as expected:
Different files for each checkpoint, plus when loaded with
|
@0x000011b , I was wondering if you have tried to "extract" LORA from your last checkpoint and compare against the lora saved by the callback? Are they the same? My machine is a bit busy, so I was not able to test this. |
I found another repo which loads from |
Hi @0x000011b, may I ask how do you use callback to correctly save and load the adapter weight? Thanks a lot! |
Hi everyone, If you install transformers' latest version or install it from source everything should work I am temporary closing this issue, feel free to re-open or open a new ticket |
Started happening sometime in the last week.
The text was updated successfully, but these errors were encountered: