[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings #1200

zengxingchen · 2024-02-29T03:39:16Z

Describe the issue

I think the code is trying to resume_from_checkpoint like its a full-parameter fine-tunung checkpoint.

zengxingchen · 2024-02-29T03:41:02Z

CynthiaChuang · 2024-03-14T03:57:32Z

I have the same issue. Can anyone tell me how to fix it?

qingyuanxingsi · 2024-03-25T04:47:52Z

+1

sunhm15 · 2024-04-01T08:07:21Z

+1

davidhalladay · 2024-04-08T06:00:41Z

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3
And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

zhipeixu · 2024-04-25T06:30:33Z

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3

And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this :

Linjyan00 · 2024-04-25T11:11:48Z

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.
In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.
Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3

And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

but i meet some problems when pip, how can you solve this :

just ignore it

davidhalladay · 2024-04-26T01:17:26Z

On my end, this compatibility issue only causes errors during testing. Therefore, I maintain two separate conda environments: one for training (with transformers==4.39.3) and one for testing (with transformers==4.37.1). While this setup may seem redundant, it offers a quick solution to address the problem.

user074 · 2024-04-30T02:55:40Z

I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746.

In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint.

Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:

This "mismatch" problem has been solved in the latest Transformers package. Thus, we need to update the Transformers package to the latest version:
pip install transformers==4.39.3

And then we need to update Accelerate as well based on the version of Transformers:
pip install accelerate==0.27.2

Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors.

Thanks! Solved my issue. I tried to save and load the LoRA checkpoints but had problems for a while

wenyisir · 2024-05-09T04:42:26Z

I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle

tetsu-kikuchi · 2024-05-09T04:57:06Z

I am afraid that non_lora_trainables.bin will not be loaded by just setting trainer.train(resume_from_checkpoint=True), because non_lora_trainables.bin is a name only specific to LLaVA and is outside the scope of huggingface.
Could anyone clarify this point?

Added : It seems that non_lora_trainables.bin is not even saved at intermediate saving steps (at every args.save_steps iterations). It is saved only when the whole training schedule is ended. In any case, I am afraid that non_lora_trainables.bin will not be loaded by using huggingface APIs, including other ways such as in #1027

Maybe we have to insert a code to load non_lora_trainables.bin in llava/train/train.py, just as is done, for example, in llava/eval/model_vqa.py. I would appreciate comments if I am misunderstanding.

bang123-box · 2024-10-11T11:11:05Z

I am afraid that non_lora_trainables.bin will not be loaded by just setting trainer.train(resume_from_checkpoint=True), because non_lora_trainables.bin is a name only specific to LLaVA and is outside the scope of huggingface. Could anyone clarify this point?

Added : It seems that non_lora_trainables.bin is not even saved at intermediate saving steps (at every args.save_steps iterations). It is saved only when the whole training schedule is ended. In any case, I am afraid that non_lora_trainables.bin will not be loaded by using huggingface APIs, including other ways such as in #1027

Maybe we have to insert a code to load non_lora_trainables.bin in llava/train/train.py, just as is done, for example, in llava/eval/model_vqa.py. I would appreciate comments if I am misunderstanding.

hello, I am also confused about this question. We should reload non_lora_trainables.bin to projector, but in the checkpoint files, I cannot find this non_lora_trainables.bin file, so what I need to do to save non_lora_trainables.bin in the saving checkpoint stage?

zengxingchen changed the title ~~[Usage] resume_from_checkpoint always failed when lora finetuning~~ [Usage] resume_from_checkpoint fails when finetuning in the lora settings Feb 29, 2024

user074 mentioned this issue May 1, 2024

[Usage] finetune_task_lora.sh checkpoints usage #1423

Open

tetsu-kikuchi mentioned this issue May 9, 2024

[Usage] errors when restore checkpoint using lora finetuning #1492

Open

yuhangzang mentioned this issue Aug 30, 2024

2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect InternLM/InternLM-XComposer#426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings #1200

[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings #1200

zengxingchen commented Feb 29, 2024

zengxingchen commented Feb 29, 2024

CynthiaChuang commented Mar 14, 2024

qingyuanxingsi commented Mar 25, 2024

sunhm15 commented Apr 1, 2024

davidhalladay commented Apr 8, 2024

zhipeixu commented Apr 25, 2024 •

edited

Loading

Linjyan00 commented Apr 25, 2024

davidhalladay commented Apr 26, 2024 •

edited

Loading

user074 commented Apr 30, 2024

wenyisir commented May 9, 2024

tetsu-kikuchi commented May 9, 2024 •

edited

Loading

bang123-box commented Oct 11, 2024

[Usage] resume_from_checkpoint fails when finetuning in the lora settings #1200

[Usage] resume_from_checkpoint fails when finetuning in the lora settings #1200

Comments

zengxingchen commented Feb 29, 2024

Describe the issue

zengxingchen commented Feb 29, 2024

CynthiaChuang commented Mar 14, 2024

qingyuanxingsi commented Mar 25, 2024

sunhm15 commented Apr 1, 2024

davidhalladay commented Apr 8, 2024

zhipeixu commented Apr 25, 2024 • edited Loading

Linjyan00 commented Apr 25, 2024

davidhalladay commented Apr 26, 2024 • edited Loading

user074 commented Apr 30, 2024

wenyisir commented May 9, 2024

tetsu-kikuchi commented May 9, 2024 • edited Loading

bang123-box commented Oct 11, 2024

[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings #1200

[Usage] `resume_from_checkpoint` fails when finetuning in the lora settings #1200

zhipeixu commented Apr 25, 2024 •

edited

Loading

davidhalladay commented Apr 26, 2024 •

edited

Loading

tetsu-kikuchi commented May 9, 2024 •

edited

Loading