Checkpoints are the full base_model and not just the lora model #353

winglian · 2023-04-21T15:24:39Z

Started happening sometime in the last week.

Rallio67 · 2023-05-08T00:47:50Z

I am also seeing this same issue when generating checkpoints with LoRA. All my checkpoints contain what appear to be the full model weights (I assume it is the merged LoRA + full model weights) and no configuration files to actually run the checkpoint.

 15K May  8 00:01 rng_state_0.pth
7.7K May  8 00:01 trainer_state.json
193M May  8 00:01 optimizer.pt
 627 May  8 00:01 scheduler.pt
3.6K May  8 00:01 training_args.bin
 37G May  8 00:01 pytorch_model.bin
 15K May  8 00:00 rng_state_1.pth
 15K May  8 00:00 rng_state_4.pth
 15K May  8 00:00 rng_state_6.pth
 15K May  8 00:00 rng_state_7.pth
 15K May  8 00:00 rng_state_2.pth
 15K May  8 00:00 rng_state_3.pth
 15K May  8 00:00 rng_state_5.pth

Is what is generated in the checkpoint directories. I tried directly using the "full model weights" in the pytorch_model.bin and it does not work. How do we extract the LoRA adapter from this file or get the checkpoint to be saved as a LoRA configured adapter that can be fun for inference.

Did you find any solution @winglian ?

0x000011b · 2023-05-08T01:14:02Z

Edit: There's good reason to believe that the code below does not work as expected - I'm leaving it for context, but I recommend trying the Trainer callback approach instead as a workaround for this.

I'm not sure whether this is intended behavior or not, but personally I do something like this to grab the adapter from a Trainer checkpoint:

import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model, set_peft_model_state_dict

BASE_MODEL = "/data/your-base-model-path-here"
OUTPUT_DIR = "/data/your-peft-adapter-will-go-here"
STATE_DICT = "/data/your-checkpoint-folder-here/pytorch_model.bin"

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)

# This needs to match your training configuration _exactly_.
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=64,
    lora_alpha=32,
    lora_dropout=0.05,
)
model = get_peft_model(model, peft_config)

full_state_dict = torch.load(STATE_DICT, map_location="cpu")
set_peft_model_state_dict(model, full_state_dict)

model.save_pretrained(OUTPUT_DIR)

Works with LLaMA trained with DeepSpeed ZeRO 1. If doing model sharding (FSDP, ZeRO 3) you might need to make some changes, but the general gist is: get the PyTorch module (the model) to be the same as the one used for training, load the state dict from the Trainer checkpoint onto it, then you can use the usual peft stuff (.from_pretrained) to spit out the adapter.

NanoCode012 · 2023-05-08T09:49:44Z

Hello,

@winglian , I found this post by a collaborator telling to use callbacks in Trainer to save model: #286 (comment)

@0x000011b , does the lora work well for you? I have tried it (made sure same config), ran successfully, but my results were quite bad. I'm not sure if it's due to my training or this method ?

I checked the keys whether they match

get_peft_model_state_dict(model).keys() 
# keys start with _orig_mod
# dict_keys(['_orig_mod.base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight', '_orig_mod.base_model.model.model.layers.0.self_attn.q_proj..

full_state_dict.keys()
# odict_keys(['base_model.model.model.embed_tokens.weight', 'base_model.model.model.layers.0.self_attn.q_proj...

Looking at this, what if the lora is not properly merged due to keys mismatch?

0x000011b · 2023-05-08T13:27:04Z

@NanoCode012 Regarding whether or not the keys match, an easy way to find out would be replacing:

set_peft_model_state_dict(model, STATE_DICT)

with:

model.load_state_dict(STATE_DICT)

This will skip a bunch of PEFT internals (which are probably there for a reason, hence I don't quite recommend it), but it's a useful test because it should output something like:

<All keys matched successfully>

Which should let you know that everything is OK.

However, I've been getting some subpar results while attempting to test a LoRA that I extracted via this method. I assumed this was down to poor training data or hparams - but @Rallio67 has also reported similar behavior, so perhaps there is indeed something wrong with this approach. Does the callback method work well for you?

NanoCode012 · 2023-05-08T14:38:37Z

@0x000011b , I get

model.load_state_dict(STATE_DICT)


RuntimeError: Error(s) in loading state_dict for OptimizedModule:
        Missing key(s) in state_dict: "_orig_mod.base_model.model.model.embed_tokens.weight"..

Unexpected key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.weight"

I remember the original tloen also has similar error but is ok to ignore. I'm not sure if it's the same thing.

I will test the callback method.

0x000011b · 2023-05-08T14:43:16Z

@NanoCode012 I've heard a lot of complaints about bugs and weird behaviors in the tloen repo recently so I'm not sure how much I'd trust that comment - if model weights are failing to load because of mismatched key names, I think something is indeed going wrong and it's not safe to ignore.

If the training code you're using does more stuff to model, you can try replicating it in the code that you use to extract the adapter, but indeed I'd just give the callback approach a shot if you can. It's what I'm doing right now.

NanoCode012 · 2023-05-08T14:57:00Z

@0x000011b , I've tested using resume_from_checkpoint.

# Extract lora
08745c9d7cb8f38aebe64c538cd5dfe2cc22f5edcd333afc4c25efb875eee954  adapter_model.bin

# Resume then save_pretrained
8671810c23f7310fe1c1933cbb227dc405873476eb241ae99e5e7fa210efcff2  adapter_model.bin

Note: If there is a bug with resume or if trainer modifies weight slightly, then this will invalidate results above.

I want to try callback but I'm not sure how I can "force" a on_save for callback since my training is complete. Hmm, I could load an earlier weight and train, but I'll try that if this new lora fails since that takes a while to train.

Edit: Loading the resumed lora gets me trainable params: 0 || all params: 6742609920 || trainable%: 0.0 which does not seem right.

Rallio67 · 2023-05-09T00:25:18Z

I'm not sure whether this is intended behavior or not, but personally I do something like this to grab the adapter from a Trainer checkpoint:
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model, set_peft_model_state_dict

BASE_MODEL = "/data/your-base-model-path-here"
OUTPUT_DIR = "/data/your-peft-adapter-will-go-here"
STATE_DICT = "/data/your-checkpoint-folder-here/pytorch_model.bin"

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL)

# This needs to match your training configuration _exactly_.
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=64,
    lora_alpha=32,
    lora_dropout=0.05,
)
model = get_peft_model(model, peft_config)

full_state_dict = torch.load(STATE_DICT, map_location="cpu")
set_peft_model_state_dict(model, full_state_dict)

model.save_pretrained(OUTPUT_DIR)
Works with LLaMA trained with DeepSpeed ZeRO 1. If doing model sharding (FSDP, ZeRO 3) you might need to make some changes, but the general gist is: get the PyTorch module (the model) to be the same as the one used for training, load the state dict from the Trainer checkpoint onto it, then you can use the usual peft stuff (.from_pretrained) to spit out the adapter.

Thank you for your help looking into this. The code does work to generate an adaptor that PEFT can accept without errors, however the adaptor is corrupted in some way since the output when using the adaptor is no different than the untrained model. I am testing with t5-xl-lm and I evaluated the converted checkpoint at 350 steps with the final output (from the training script) and the output is all good with the model generated at the completion of the training script (352 steps in my case).

NanoCode012 · 2023-05-09T03:11:23Z

@0x000011b , I have fine-tuned a simple model using callbacks (code here: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/callbacks.py)

I have no idea if it's an implementation issue or some training issue, but all my adapter_model.bin within checkpoint folder are all the same (within checkpoint 1.6-1.8k) using sha256sum.

The final one in the output folder is different ( after model trained ). How have your results fared?

The code does work to generate an adaptor that PEFT can accept without errors, however the adaptor is corrupted in some way since the output when using the adaptor is no different than the untrained model.

@Rallio67 , yes, I suspect that something along this line happened as well. Did you mean to say, the final adapter works, but not the extracted one?

Rallio67 · 2023-05-09T04:15:20Z

@NanoCode012 if you get to the end of the trainer training loop using LoRA PEFT the final saved model (not any of the checkpoints) does work and gives good expected performance. I have no figured out a way to make any of the checkpoints work.

NanoCode012 · 2023-05-09T09:19:14Z

@0x000011b @Rallio67 , I checked the source code and this method of "extracting" lora seems to exactly as the one used to load adapter_model.bin. It does not relate to pytorch_model.bin.

peft/src/peft/peft_model.py

Lines 372 to 376 in b1059b7

    
           adapters_weights = torch.load( 
        
               filename, map_location=torch.device("cuda" if torch.cuda.is_available() else "cpu") 
        
           ) 
        
           # load the weights into the model 
        
           set_peft_model_state_dict(self, adapters_weights, adapter_name=adapter_name)

In fact, this makes me suspect whether the below works in loading. I cannot find any code within this repo messing with the saving of checkpoints, so theoretically, it should load all weights properly, but what if the peft weights aren't loaded properly?

trainer.train(resume_from_checkpoint=resume_from_checkpoint) # folder with pytorch_model.bin

NanoCode012 · 2023-05-09T16:32:44Z

@0x000011b , I have fine-tuned a simple model using callbacks (code here: https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/callbacks.py)

I have no idea if it's an implementation issue or some training issue, but all my adapter_model.bin within checkpoint folder are all the same (within checkpoint 1.6-1.8k) using sha256sum.

I redid a training for this. I had an issue with optimizer due to some code changes. I believe the callback does work. The result seem somewhat ok for what it's given (small dataset).

0x000011b · 2023-05-09T16:38:43Z

@NanoCode012 I can confirm that on my end the callback does indeed seem to work as expected:

checkpoint-720/adapter_model/adapter_model.bin A63CEAAD
checkpoint-780/adapter_model/adapter_model.bin 7D67E129

Different files for each checkpoint, plus when loaded with from_pretrained the model is coherent and seems to be learning from the training data. Here are the versions I'm using of all the relevant packages, just in case:

accelerate 565152183334f709ac955204ef663023d1f63b7a
transformers 3d3204c025b6b5de013e07dd364208e28b4d9589
peft 382b178911edff38c1ff619bbac2ba556bd2276b
deepspeed 0.8.3 (regular pip install)

NanoCode012 · 2023-05-09T16:42:33Z

@0x000011b , I was wondering if you have tried to "extract" LORA from your last checkpoint and compare against the lora saved by the callback? Are they the same?

My machine is a bit busy, so I was not able to test this.

NanoCode012 · 2023-05-10T11:11:30Z

I found another repo which loads from pytorch_model.bin , then sets the weight to the model. This follows the same principle as the extract lora above. https://github.com/Facico/Chinese-Vicuna/blob/cd04b2d8c3ed07c921b03b4f9fc1e56969a997a1/finetune.py#L89-L113

annahung31 · 2023-06-01T02:42:31Z

@NanoCode012 I can confirm that on my end the callback does indeed seem to work as expected:
checkpoint-720/adapter_model/adapter_model.bin A63CEAAD
checkpoint-780/adapter_model/adapter_model.bin 7D67E129
Different files for each checkpoint, plus when loaded with from_pretrained the model is coherent and seems to be learning from the training data. Here are the versions I'm using of all the relevant packages, just in case:
accelerate 565152183334f709ac955204ef663023d1f63b7a
transformers 3d3204c025b6b5de013e07dd364208e28b4d9589
peft 382b178911edff38c1ff619bbac2ba556bd2276b
deepspeed 0.8.3 (regular pip install)

Hi @0x000011b, may I ask how do you use callback to correctly save and load the adapter weight? Thanks a lot!

younesbelkada · 2023-06-21T15:37:46Z

Hi everyone,
The issues related to saving PEFT models should have been resolved in the recent PRs on the HF trainer: huggingface/transformers#24073 / huggingface/transformers#24103 / huggingface/transformers#24274

If you install transformers' latest version or install it from source everything should work

I am temporary closing this issue, feel free to re-open or open a new ticket

NanoCode012 mentioned this issue May 8, 2023

Save adapter_bin using callbacks if lora axolotl-ai-cloud/axolotl#18

Closed

BabyChouSr mentioned this issue May 16, 2023

how to load the ckpt of the Lora training? lm-sys/FastChat#1249

Open

younesbelkada closed this as completed Jun 21, 2023

Richard-Wth mentioned this issue Sep 1, 2023

sft训练之后只有pytorch_model.bin，没有adapter_model.bin hiyouga/LLaMA-Factory#760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoints are the full base_model and not just the lora model #353

Checkpoints are the full base_model and not just the lora model #353

winglian commented Apr 21, 2023

Rallio67 commented May 8, 2023 •

edited

Loading

0x000011b commented May 8, 2023 •

edited

Loading

NanoCode012 commented May 8, 2023 •

edited

Loading

0x000011b commented May 8, 2023

NanoCode012 commented May 8, 2023 •

edited

Loading

0x000011b commented May 8, 2023

NanoCode012 commented May 8, 2023 •

edited

Loading

Rallio67 commented May 9, 2023

NanoCode012 commented May 9, 2023 •

edited

Loading

Rallio67 commented May 9, 2023

NanoCode012 commented May 9, 2023

NanoCode012 commented May 9, 2023 •

edited

Loading

0x000011b commented May 9, 2023

NanoCode012 commented May 9, 2023 •

edited

Loading

NanoCode012 commented May 10, 2023

annahung31 commented Jun 1, 2023

younesbelkada commented Jun 21, 2023

Checkpoints are the full base_model and not just the lora model #353

Checkpoints are the full base_model and not just the lora model #353

Comments

winglian commented Apr 21, 2023

Rallio67 commented May 8, 2023 • edited Loading

0x000011b commented May 8, 2023 • edited Loading

NanoCode012 commented May 8, 2023 • edited Loading

0x000011b commented May 8, 2023

NanoCode012 commented May 8, 2023 • edited Loading

0x000011b commented May 8, 2023

NanoCode012 commented May 8, 2023 • edited Loading

Rallio67 commented May 9, 2023

NanoCode012 commented May 9, 2023 • edited Loading

Rallio67 commented May 9, 2023

NanoCode012 commented May 9, 2023

NanoCode012 commented May 9, 2023 • edited Loading

0x000011b commented May 9, 2023

NanoCode012 commented May 9, 2023 • edited Loading

NanoCode012 commented May 10, 2023

annahung31 commented Jun 1, 2023

younesbelkada commented Jun 21, 2023

Rallio67 commented May 8, 2023 •

edited

Loading

0x000011b commented May 8, 2023 •

edited

Loading

NanoCode012 commented May 8, 2023 •

edited

Loading

NanoCode012 commented May 8, 2023 •

edited

Loading

NanoCode012 commented May 8, 2023 •

edited

Loading

NanoCode012 commented May 9, 2023 •

edited

Loading

NanoCode012 commented May 9, 2023 •

edited

Loading

NanoCode012 commented May 9, 2023 •

edited

Loading