-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage] resume_from_checkpoint
fails when finetuning in the lora settings
#1200
Comments
resume_from_checkpoint
always failed when lora finetuningresume_from_checkpoint
fails when finetuning in the lora settings
I have the same issue. Can anyone tell me how to fix it? |
+1 |
1 similar comment
+1 |
I encountered this error while resuming the checkpoint of Lora training. I found that this is basically due to the old version of Transformers that LLaVA is using. Please refer to this issue: huggingface/peft#746. In this issue, the names of keys in the checkpoint saved via deepspeed have some mismatches with those saved via Transformers. Specifically, there is an added ".default." in each key of the non-trainable parameters, leading to errors while loading the checkpoint. Here is a solution that I found. I have only tested it for Lora training, and it works well. However, I haven't tested it for other features, thus, it may potentially introduce further errors:
Again, this works for me so far only with Lora training. I'm not sure whether this will introduce other errors. |
|
just ignore it |
On my end, this compatibility issue only causes errors during testing. Therefore, I maintain two separate conda environments: one for training (with transformers==4.39.3) and one for testing (with transformers==4.37.1). While this setup may seem redundant, it offers a quick solution to address the problem. |
Thanks! Solved my issue. I tried to save and load the LoRA checkpoints but had problems for a while |
I fixed this bug by modifying it: |
I am afraid that Added : It seems that Maybe we have to insert a code to load |
hello, I am also confused about this question. We should reload non_lora_trainables.bin to projector, but in the checkpoint files, I cannot find this non_lora_trainables.bin file, so what I need to do to save non_lora_trainables.bin in the saving checkpoint stage? |
Describe the issue
I think the code is trying to resume_from_checkpoint like its a full-parameter fine-tunung checkpoint.
The text was updated successfully, but these errors were encountered: