-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493
Conversation
…eepspeedai#5256)" (deepspeedai#5461)" This reverts commit bc48371.
place err tensor on the same device as inf_or_nan
Hi @lekurile , |
Hi @nelyahu, thank you for the PR. I've kicked off a test run here: |
Thanks @lekurile , seems like it passed, can you confirm? |
Yep, looks like it passed, approved the PR and running all checks. |
@@ -1409,7 +1409,7 @@ def complete_grad_norm_calculation_for_cpu_offload(self, params): | |||
norm_is_nan = total_norm.isnan() | |||
inf_or_nan = norm_is_nan.logical_or(norm_is_inf) | |||
|
|||
err = torch.tensor(-1.0, device=self.device, dtype=torch.float) | |||
err = torch.tensor(-1.0, device=inf_or_nan.device, dtype=torch.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this branch also reintroduce the previous PR? Asking because the code changes there were different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lekurile Yes, this PR includes 2 commits (original PR, and the fix). this specific line change is part of the fix for the bug.
…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.
…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.
reverting previous revert of this feature:
nelyahu@bc48371
in addition,
bug fix for offload mode.