-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix checkpoint conversion when model layers share weights #3825
Fix checkpoint conversion when model layers share weights #3825
Conversation
@microsoft-github-policy-service agree |
Hey DeepSpeed team @tjruwase @ShijieZZZZ @jeffra, could you take a look at my bugfix here? I appreciate the feedback. |
@awaelchli, thanks for the PR. We will review immediately. |
I would really appreciate it if I could get some feedback on the changes. We have several users who reported seeing this problem in Lightning (Lightning-AI/pytorch-lightning#16277, Lightning-AI/pytorch-lightning#15694). Even if the answer is no, I'd like to know. I would really prefer to have the fix in DeepSpeed, because if this PR can't be merged, we would have to patch the DeepSpeedEngine method from within Lightning. This would add additional maintenance effort for us as to keep the code updated, and then the issue would still be there for regular deepspeed users. |
@awaelchli, apologies for the delay. PR looks good. |
Amazing, thanks for taking another look! |
Fixes #3824
The added test case fails on master.