Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to modify weights during training in a deepspeed stage 3 model #3830

Closed
lucasosouza opened this issue Jun 28, 2023 · 8 comments · Fixed by #4192
Closed

How to modify weights during training in a deepspeed stage 3 model #3830

lucasosouza opened this issue Jun 28, 2023 · 8 comments · Fixed by #4192
Assignees

Comments

@lucasosouza
Copy link

lucasosouza commented Jun 28, 2023

This issue is more of a general question than a bug report: we are working with deepspeed stage 3 and model parallelization (training 10+ billion params LLMs). We need to be able to modify weights every N steps during training, but outside of the regular optimization process.

I've seen multiple ways of exporting and accessing weights from deepspeed models, but I haven't found any methods to modify them directly during training.

One option would be during saving/loading - but although I can load the weights using get_fp32_state_dict_from_zero_checkpoint, and possibly modify the state dict, I also can't find a way to reload this modified state dict into the model and continue training with the new set of weights.

I am currently using deepspeed as the strategy in Fabric (pytorch-lightning). Appreciate the help or any pointers.

@tjruwase
Copy link
Contributor

@lucasosouza, thanks for your question. We have recently added supporting for examining various weight and optimizer state values, documented here. It sounds like you want the inverse functionality that can modify these values. Is that correct?

@tjruwase
Copy link
Contributor

tjruwase commented Jun 28, 2023

@lucasosouza
Copy link
Author

functionality

Yeah, exactly, we need to modify the weights. I am going through the sync_layer_norm example see if it applies to our use case. Would you know if there is any other available interface in deepspeed to modify the distributed weights (or instead modify a buffer registered to a nn.Module)?

@tjruwase
Copy link
Contributor

There is no other interface in deepspeed for modifying distributed weights. The above is our first attempt to provide this kind of support. The current support does not even apply to registered buffers.

@tjruwase
Copy link
Contributor

@lucasosouza, how are things going with this? Did you find the sync_layer_norm example useful? Thanks.

@tjruwase
Copy link
Contributor

@lucasosouza, @dsj96, can you please provide feedback on the linked PR #4192? Thanks!

@lucasosouza
Copy link
Author

PR #4192 solves for the use case we have! Thanks a lot

@lucasosouza
Copy link
Author

Closing the issue given PR is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants