-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate FSDP always removed {'model.norm.weight'} layer of model when saving them #2155
Comments
cc @pacman100 |
Hello @gauss5930, what are the contents of the final output directory? |
Hello @pacman100 ! Unfortunately, I don't have any figures to show you how my model was uploaded. Would this have provided enough information?If you have any further questions, please feel free to ask! Thank you. |
There is something I just discovered additionally. When I checked I hope this information helps! |
Some things to add is that this Could it be that Accelerate simply remove the weight based on just that? As I simply could not comprehend this function https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/other.py#L147 |
I also meet this problem and here is my config compute_environment: LOCAL_MACHINE |
Also running into a similar issue with the run_mlm_no_trainer example.
Then running into errors when trying to use accelerator.load_state |
@pranaydeeps probably related to huggingface/transformers#27293 & huggingface/transformers#27972 for your part there |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Had a similar issue with a GRU in my model, where I was getting |
@gauss5930, just want to ask did you solve your issue? Thank you |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Run my own fine-tuning code with the Accelerate FSDP config mentioned above on 2 * A100 80G GPUs. I also used
use_flash_attention_2=True
andgradient_checkpointing=True
. The following command and code were used for fine-tuning. Actually, I did set the epoch to 3 and max_step to 'max', but I changed the value of the hyperparameter to check the error message early.Expected behavior
The training was conducted very well, but the problem occurred when saving and uploading the model to the HuggingFace hub. I saw that the code execution log showed the message
Removed shared tensor {'model.norm.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
before starting to upload the model.Although that message made me feel disturbed, I just moved on. However, I met a huge problem when loading my fine-tuned model. The error message is as follows.
I took a lot of time to solve this problem by googling or asking a question at the HuggingFace Forum, however, I was not able to find any solution or something that helped me to solve this obstacle. Please let me know how to solve this problem! I really want to save the model completely.
The text was updated successfully, but these errors were encountered: