-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to modify weights during training in a deepspeed stage 3 model #3830
Comments
@lucasosouza, thanks for your question. We have recently added supporting for examining various weight and optimizer state values, documented here. It sounds like you want the inverse functionality that can modify these values. Is that correct? |
You might also find this documentation of BLOOM model training useful, in particular the following sections regarding on-the-fly fixes of model bugs: |
Yeah, exactly, we need to modify the weights. I am going through the |
There is no other interface in deepspeed for modifying distributed weights. The above is our first attempt to provide this kind of support. The current support does not even apply to registered buffers. |
@lucasosouza, how are things going with this? Did you find the |
@lucasosouza, @dsj96, can you please provide feedback on the linked PR #4192? Thanks! |
PR #4192 solves for the use case we have! Thanks a lot |
Closing the issue given PR is merged |
This issue is more of a general question than a bug report: we are working with deepspeed stage 3 and model parallelization (training 10+ billion params LLMs). We need to be able to modify weights every N steps during training, but outside of the regular optimization process.
I've seen multiple ways of exporting and accessing weights from deepspeed models, but I haven't found any methods to modify them directly during training.
One option would be during saving/loading - but although I can load the weights using
get_fp32_state_dict_from_zero_checkpoint
, and possibly modify the state dict, I also can't find a way to reload this modified state dict into the model and continue training with the new set of weights.I am currently using deepspeed as the strategy in Fabric (pytorch-lightning). Appreciate the help or any pointers.
The text was updated successfully, but these errors were encountered: