Empty grad fix #291

jeffra · 2020-07-15T18:39:50Z

This fixes a case where there is an imbalance between empty gradients across ranks. Example: rank 0 has 2 parameters with gradients and rank 1 has 1 parameter with gradients and 1 parameter where grads are None. This caused the all-reduce to be imbalanced in size since we we're previously ignoring all grads that were None. Instead now we pad all None gradients with zero tensors. Unfortunately this imbalance did not cause a crash it would cause hanging issues in the first place that tried to access the reduced gradient data itself.

deepspeed/pt/deepspeed_light.py

tests/unit/test_fp16.py

jeffra added 2 commits July 15, 2020 18:39

empty grad fix

3b19ea7

add unit tests for empty grad

c63ba91

jeffra requested review from tjruwase and samyam July 15, 2020 19:19

frankseide reviewed Jul 15, 2020

View reviewed changes

deepspeed/pt/deepspeed_light.py Show resolved Hide resolved

samyam reviewed Jul 15, 2020

View reviewed changes

deepspeed/pt/deepspeed_light.py Show resolved Hide resolved

samyam approved these changes Jul 15, 2020

View reviewed changes

tjruwase reviewed Jul 15, 2020

View reviewed changes

tests/unit/test_fp16.py Outdated Show resolved Hide resolved

tjruwase approved these changes Jul 15, 2020

View reviewed changes

jeffra added 2 commits July 15, 2020 12:52

Merge branch 'master' into jeffra/empty-grad

618d212

update to fix tests and address comments

97c247e

jeffra merged commit 376818e into master Jul 15, 2020

jeffra deleted the jeffra/empty-grad branch July 15, 2020 21:15

jeffra mentioned this pull request Aug 13, 2020

Attach empty grad to its param to ensure it's copied after reduction #316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty grad fix #291

Empty grad fix #291

jeffra commented Jul 15, 2020 •

edited

Loading

Empty grad fix #291

Empty grad fix #291

Conversation

jeffra commented Jul 15, 2020 • edited Loading

jeffra commented Jul 15, 2020 •

edited

Loading