Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493

Merged
merged 2 commits into from
May 3, 2024

Conversation

nelyahu
Copy link
Contributor

@nelyahu nelyahu commented May 2, 2024

reverting previous revert of this feature:
nelyahu@bc48371
in addition,
bug fix for offload mode.

@nelyahu nelyahu requested review from tjruwase and mrwyattii as code owners May 2, 2024 06:30
@nelyahu
Copy link
Contributor Author

nelyahu commented May 2, 2024

Hi @lekurile ,
can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it.
would like to get a pre-commit validation on this test suit.
CC: @tjruwase

@lekurile
Copy link
Contributor

lekurile commented May 2, 2024

Hi @lekurile , can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it. would like to get a pre-commit validation on this test suit. CC: @tjruwase

Hi @nelyahu, thank you for the PR. I've kicked off a test run here:
https://github.com/microsoft/DeepSpeed/actions/runs/8927396023

@nelyahu
Copy link
Contributor Author

nelyahu commented May 2, 2024

Hi @lekurile , can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it. would like to get a pre-commit validation on this test suit. CC: @tjruwase

Hi @nelyahu, thank you for the PR. I've kicked off a test run here: https://github.com/microsoft/DeepSpeed/actions/runs/8927396023

Thanks @lekurile , seems like it passed, can you confirm?

@lekurile
Copy link
Contributor

lekurile commented May 2, 2024

Hi @lekurile , can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it. would like to get a pre-commit validation on this test suit. CC: @tjruwase

Hi @nelyahu, thank you for the PR. I've kicked off a test run here: https://github.com/microsoft/DeepSpeed/actions/runs/8927396023

Thanks @lekurile , seems like it passed, can you confirm?

Yep, looks like it passed, approved the PR and running all checks.

@lekurile lekurile self-requested a review May 2, 2024 19:09
@@ -1409,7 +1409,7 @@ def complete_grad_norm_calculation_for_cpu_offload(self, params):
norm_is_nan = total_norm.isnan()
inf_or_nan = norm_is_nan.logical_or(norm_is_inf)

err = torch.tensor(-1.0, device=self.device, dtype=torch.float)
err = torch.tensor(-1.0, device=inf_or_nan.device, dtype=torch.float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this branch also reintroduce the previous PR? Asking because the code changes there were different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lekurile Yes, this PR includes 2 commits (original PR, and the fix). this specific line change is part of the fix for the bug.

@lekurile lekurile added this pull request to the merge queue May 3, 2024
Merged via the queue into deepspeedai:master with commit 90793aa May 3, 2024
16 checks passed
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
…eepspeedai#5493)

reverting previous revert of this feature:

nelyahu@bc48371
in addition,
bug fix for offload mode.
@nelyahu nelyahu deleted the offload_fix branch June 9, 2024 10:06
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
…eepspeedai#5493)

reverting previous revert of this feature:

nelyahu@bc48371
in addition,
bug fix for offload mode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants