Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save ZeRO3 (partitioned) fp16 weights #882

Closed
wants to merge 2 commits into from

Conversation

tjruwase
Copy link
Contributor

Save ZeRO3 (partitioned) fp16 weights. This is a first step to using ZeRO3 weights outside DeepSpeed, #872.

@stas00
Copy link
Collaborator

stas00 commented Mar 19, 2021

That still leaves the partitions separate, so this is great if a user wants to load each partition separately, but doesn't work for when a user needs the model weights consolidated.

Also I don't think this PR should do this by default as it adds an overhead that most users won't need. So it should be configurable.

And also as suggested elsewhere the model_states.pt file with fake weights probably shouldn't even be saved as it just confuses the users who try to load it and it's guaranteed to fail.

def save_partitioned_weights(self, state_dict):
for name, param in self.module.named_parameters():
if name in state_dict.keys():
state_dict[name] = param.ds_tensor
Copy link
Collaborator

@stas00 stas00 Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an issue here: param.ds_tensor in this place appears to be is a flattened buffer. So state_dicts ends up being populated with 1D vectors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can't shape it back to the original since we only have a part of the tensor, so doing something like narrow(0, 0, param.ds_numel).view(param.ds_shape) from _allgather_param() won't work and the shape has no meaning here anyway.

So this line of logic is useful when it's used to load the param.ds_tensor directly by each gpu, as coded in the rest of this PR.

I just tried to use it to get the partitioned fp16 weights, but now I understand this is not possible using this approach.

Bottom line - there is no problem here, just needed to understand that this is not a real state_dict that is being saved but something like flattened_params_state_dict.

All is good!

@tjruwase
Copy link
Contributor Author

Redundant by #892 and #893

@tjruwase tjruwase closed this Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants