How to modify weights during training in a deepspeed stage 3 model #3830

lucasosouza · 2023-06-28T00:14:09Z

This issue is more of a general question than a bug report: we are working with deepspeed stage 3 and model parallelization (training 10+ billion params LLMs). We need to be able to modify weights every N steps during training, but outside of the regular optimization process.

I've seen multiple ways of exporting and accessing weights from deepspeed models, but I haven't found any methods to modify them directly during training.

One option would be during saving/loading - but although I can load the weights using get_fp32_state_dict_from_zero_checkpoint, and possibly modify the state dict, I also can't find a way to reload this modified state dict into the model and continue training with the new set of weights.

I am currently using deepspeed as the strategy in Fabric (pytorch-lightning). Appreciate the help or any pointers.

The text was updated successfully, but these errors were encountered:

tjruwase · 2023-06-28T13:10:16Z

@lucasosouza, thanks for your question. We have recently added supporting for examining various weight and optimizer state values, documented here. It sounds like you want the inverse functionality that can modify these values. Is that correct?

tjruwase · 2023-06-28T13:18:10Z

You might also find this documentation of BLOOM model training useful, in particular the following sections regarding on-the-fly fixes of model bugs:

lucasosouza · 2023-06-28T17:31:47Z

functionality

Yeah, exactly, we need to modify the weights. I am going through the sync_layer_norm example see if it applies to our use case. Would you know if there is any other available interface in deepspeed to modify the distributed weights (or instead modify a buffer registered to a nn.Module)?

tjruwase · 2023-06-28T17:48:01Z

There is no other interface in deepspeed for modifying distributed weights. The above is our first attempt to provide this kind of support. The current support does not even apply to registered buffers.

tjruwase · 2023-07-28T13:29:51Z

@lucasosouza, how are things going with this? Did you find the sync_layer_norm example useful? Thanks.

tjruwase · 2023-08-22T15:42:40Z

@lucasosouza, @dsj96, can you please provide feedback on the linked PR #4192? Thanks!

lucasosouza · 2023-08-31T22:21:48Z

PR #4192 solves for the use case we have! Thanks a lot

lucasosouza · 2023-08-31T22:22:06Z

Closing the issue given PR is merged

tjruwase self-assigned this Jul 28, 2023

tjruwase mentioned this issue Aug 15, 2023

setting zero_stage 3 makes parameters.numel() 0 deepspeedai/DeepSpeedExamples#659

Closed

tjruwase mentioned this issue Aug 22, 2023

Allow modification of zero partitioned parameters #4192

Merged

lucasosouza closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to modify weights during training in a deepspeed stage 3 model #3830

How to modify weights during training in a deepspeed stage 3 model #3830

lucasosouza commented Jun 28, 2023 •

edited

Loading

tjruwase commented Jun 28, 2023

tjruwase commented Jun 28, 2023 •

edited

Loading

lucasosouza commented Jun 28, 2023

tjruwase commented Jun 28, 2023

tjruwase commented Jul 28, 2023

tjruwase commented Aug 22, 2023

lucasosouza commented Aug 31, 2023

lucasosouza commented Aug 31, 2023

How to modify weights during training in a deepspeed stage 3 model #3830

How to modify weights during training in a deepspeed stage 3 model #3830

Comments

lucasosouza commented Jun 28, 2023 • edited Loading

tjruwase commented Jun 28, 2023

tjruwase commented Jun 28, 2023 • edited Loading

lucasosouza commented Jun 28, 2023

tjruwase commented Jun 28, 2023

tjruwase commented Jul 28, 2023

tjruwase commented Aug 22, 2023

lucasosouza commented Aug 31, 2023

lucasosouza commented Aug 31, 2023

lucasosouza commented Jun 28, 2023 •

edited

Loading

tjruwase commented Jun 28, 2023 •

edited

Loading