-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix step in adam #1823
fix step in adam #1823
Conversation
@szhengac, thanks! |
Just became aware of issues I had overlooked.
@szhengac, apologies I am reverting my approval because my colleagues just made me aware of some issues that require a bit more time to understand. Primarily, this fix changes the behavior of non zero stage 3 logic and will break existing checkpoints since it changes checkpoint state. Ideally, since only zero stage 3 exhibits a problem then the fix should probably be in stage 3 rather than in shared fused adam code. Apologies for the confusion. |
@szhengac, could you please provide a unit test that demonstrates the original zero stage 3 problem? Thanks! |
@tjruwase I can provide an unit test. The main problem of this is we will get the wrong optimizer behaviors since we will be using different step_id's for different parameters at the same iteration. I checked the implementation of other two optimizers in DeepSpeed, which also use lamb: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/lamb/fused_lamb.py#L163 |
@szhengac, how about making these two changes for backwards compatibility for existing stage < 3 checkpoints.
I will annotate the PR with these suggestions. |
1 similar comment
@szhengac, how about making these two changes for backwards compatibility for existing stage < 3 checkpoints.
I will annotate the PR with these suggestions. |
deepspeed/ops/adam/fused_adam.py
Outdated
@@ -131,6 +124,8 @@ def step(self, | |||
state = self.state[p] | |||
# State initialization | |||
if len(state) == 0: | |||
# DeepSpeed processes each subgroup a time, so we need to keep tracking step for each tensor separately | |||
state['step'] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state['step'] = group.get('step', 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good idea. I will also put a more detailed comment to explain it otherwise other readers may get confused.
deepspeed/ops/adam/fused_adam.py
Outdated
@@ -131,6 +124,8 @@ def step(self, | |||
state = self.state[p] | |||
# State initialization | |||
if len(state) == 0: | |||
# DeepSpeed processes each subgroup a time, so we need to keep tracking step for each tensor separately | |||
state['step'] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state['step'] = group.get('step', 0)
@tjruwase any update on this PR? |
This looks good to me. Is it possible for you to do some e2e comparing loss curves before and after this change for zero < 3? |
@tjruwase Sure, will do it when I have capacity for the experiments. I have done some e2e finetuning with zero 3 after the fix, it improves accuracy by 6-7% when there are two param groups. |
That is exciting news about the zero3 improvement :). |
@szhengac, just checking if you had time to push this? Thanks! |
|
@tjruwase the conflict has been resolved. |
@szhengac, thanks for resolving the conflict. Did you get a chance to measure convergence impact for zero < 3? Thanks! |
I haven't got extra machine capacity to test it. |
@tjruwase I have done some loss testing and I can obtain exactly the same loss curve. |
But at the same time, I also realized zero 1 and 2 give different loss trajectories regardless of this PR. Similar issue was reported before: #966 |
I thought for that issue we had resolved the loss divergence across the zero stages, right? |
I thought so. But the following script gives me different results:
|
Got it. This means we have a regression in DeepSpeed and is independent of this PR. Could you please create a new issue with your test case? Thanks so much for sharing this test case, in addition to this PR. The PR is good to merge. |
* fix step in adam * fix backward compatibility and add unittest * add unittest * fix unbounded error when there are more than 1 param groups * fix typo * remove trailing whitespace * fix end of file Co-authored-by: Shuai Zheng <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
fix step in adam (deepspeedai#1823)
deepspeed zero performs optimization step on each sub-group independently, but adam implementation uses a single step for the whole param group so different sub-groups will use different steps to compute exponential weight for first-order and second-order momentum. This pr fixes this issue.