Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint conversion when model layers share weights #3825

Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jun 27, 2023

Fixes #3824

The added test case fails on master.

@awaelchli
Copy link
Contributor Author

@microsoft-github-policy-service agree

@awaelchli awaelchli marked this pull request as ready for review June 27, 2023 09:27
@awaelchli
Copy link
Contributor Author

Hey DeepSpeed team @tjruwase @ShijieZZZZ @jeffra, could you take a look at my bugfix here? I appreciate the feedback.

@tjruwase
Copy link
Contributor

tjruwase commented Jul 3, 2023

@awaelchli, thanks for the PR. We will review immediately.

@tjruwase tjruwase requested review from ShijieZZZZ and removed request for jeffra and mrwyattii July 3, 2023 15:41
@awaelchli
Copy link
Contributor Author

I would really appreciate it if I could get some feedback on the changes. We have several users who reported seeing this problem in Lightning (Lightning-AI/pytorch-lightning#16277, Lightning-AI/pytorch-lightning#15694).

Even if the answer is no, I'd like to know. I would really prefer to have the fix in DeepSpeed, because if this PR can't be merged, we would have to patch the DeepSpeedEngine method from within Lightning. This would add additional maintenance effort for us as to keep the code updated, and then the issue would still be there for regular deepspeed users.

@tjruwase
Copy link
Contributor

@awaelchli, apologies for the delay. PR looks good.

@awaelchli
Copy link
Contributor Author

Amazing, thanks for taking another look!

@tjruwase tjruwase added this pull request to the merge queue Jul 19, 2023
Merged via the queue into deepspeedai:master with commit fb9aebb Jul 19, 2023
@awaelchli awaelchli deleted the bugfix/stage-2-recover-shared-weights branch July 19, 2023 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Unable to reconstruct full weights from ZeRO-2 checkpoint when layers share weights
3 participants