re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493

nelyahu · 2024-05-02T06:30:51Z

reverting previous revert of this feature:
nelyahu@bc48371
in addition,
bug fix for offload mode.

…eepspeedai#5256)" (deepspeedai#5461)" This reverts commit bc48371.

place err tensor on the same device as inf_or_nan

nelyahu · 2024-05-02T06:32:55Z

Hi @lekurile ,
can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it.
would like to get a pre-commit validation on this test suit.
CC: @tjruwase

lekurile · 2024-05-02T16:42:35Z

Hi @lekurile , can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it. would like to get a pre-commit validation on this test suit. CC: @tjruwase

Hi @nelyahu, thank you for the PR. I've kicked off a test run here:
https://github.com/microsoft/DeepSpeed/actions/runs/8927396023

nelyahu · 2024-05-02T17:58:55Z

Hi @lekurile , can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it. would like to get a pre-commit validation on this test suit. CC: @tjruwase

Hi @nelyahu, thank you for the PR. I've kicked off a test run here: https://github.com/microsoft/DeepSpeed/actions/runs/8927396023

Thanks @lekurile , seems like it passed, can you confirm?

lekurile · 2024-05-02T19:09:04Z

Hi @lekurile , can you please run ds-chat coverage on this PR? i reproduced the issue was reproted in nelyahu@bc48371 and fixed it. would like to get a pre-commit validation on this test suit. CC: @tjruwase

Hi @nelyahu, thank you for the PR. I've kicked off a test run here: https://github.com/microsoft/DeepSpeed/actions/runs/8927396023

Thanks @lekurile , seems like it passed, can you confirm?

Yep, looks like it passed, approved the PR and running all checks.

lekurile · 2024-05-02T19:10:39Z

deepspeed/runtime/zero/stage3.py

@@ -1409,7 +1409,7 @@ def complete_grad_norm_calculation_for_cpu_offload(self, params):
        norm_is_nan = total_norm.isnan()
        inf_or_nan = norm_is_nan.logical_or(norm_is_inf)

-        err = torch.tensor(-1.0, device=self.device, dtype=torch.float)
+        err = torch.tensor(-1.0, device=inf_or_nan.device, dtype=torch.float)


Does this branch also reintroduce the previous PR? Asking because the code changes there were different.

@lekurile Yes, this PR includes 2 commits (original PR, and the fix). this specific line change is part of the fix for the bug.

…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.

nelyahu added 2 commits May 1, 2024 12:46

Revert "Revert "stage3: efficient compute of scaled_global_grad_norm (d…

63a89be

…eepspeedai#5256)" (deepspeedai#5461)" This reverts commit bc48371.

fix for complete_grad_norm_calc in stage3

95aee34

place err tensor on the same device as inf_or_nan

nelyahu requested review from tjruwase and mrwyattii as code owners May 2, 2024 06:30

lekurile self-requested a review May 2, 2024 19:09

lekurile approved these changes May 3, 2024

View reviewed changes

lekurile added this pull request to the merge queue May 3, 2024

Merged via the queue into deepspeedai:master with commit 90793aa May 3, 2024
16 checks passed

umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024

re-introduce: stage3: efficient compute of scaled_global_grad_norm (d…

dd1597c

…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.

lihe07 mentioned this pull request May 22, 2024

[BUG] Version >0.14.0 leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #5538

Closed

nelyahu deleted the offload_fix branch June 9, 2024 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493

re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493

nelyahu commented May 2, 2024

nelyahu commented May 2, 2024

lekurile commented May 2, 2024

nelyahu commented May 2, 2024

lekurile commented May 2, 2024

lekurile May 2, 2024

nelyahu May 3, 2024

re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493

re-introduce: stage3: efficient compute of scaled_global_grad_norm #5493

Conversation

nelyahu commented May 2, 2024

nelyahu commented May 2, 2024

lekurile commented May 2, 2024

nelyahu commented May 2, 2024

lekurile commented May 2, 2024

lekurile May 2, 2024

Choose a reason for hiding this comment

nelyahu May 3, 2024

Choose a reason for hiding this comment