Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Revert "stage3: efficient compute of scaled_global_grad_norm (deepspe…
…edai#5256)" (deepspeedai#5461) This reverts commit 54c0687 due to deepspeedai#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: | Failing Test Name | | --- | | test_ds_chat[zero3--offload-] | | test_ds_chat[zero3--offload-lora] | | test_ds_chat[zero3-he-offload-] | | test_ds_chat[zero3-he-offload-lora] | Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.
- Loading branch information
bc48371
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lekurile ,
sorry, we were OOO Passover holiday.
I was able to reproduce the issue locally and fix it.
Will re-introduce the PR with the fix later on today