You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using BF16 to train CIFAR10 following deepspeedai/DeepSpeedExamples#651. I encounter accuracy loss with the following conditions:
run the run_ds.sh with multiple accelerators on the system
Observe accuracy loss with multiple ranks. For 2 ranks, accuracy is 52%, much less than 57% of 1 rank. For 8 ranks, accuracy drops to around 43%
This can be observed from two CUDA cards. The 8 rank result is observed from a WIP CPU training branch. Should also be observed on 8 CUDA card system
When change data type to fp32 (on CPU system), there is no issue with 2 ranks or 8 ranks. When change to zero stage 2, there is no accuracy issue.
Expected behavior
BF16 train accuracy the same as single rank.
ds_report output
[2023-07-18 12:16:47,983] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2023-07-18 12:16:48,129] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/gma/anaconda3/envs/dscpu/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/gma/DeepSpeed/deepspeed']
deepspeed info ................... 0.9.4+046afced, 046afced, master
deepspeed wheel compiled w. ...... torch 2.0
Screenshots
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
GroundTruth: cat ship ship plane
GroundTruth: cat ship ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
Predicted: cat plane ship plane
GroundTruth: cat ship ship planePredicted:
cat plane ship plane
GroundTruth: cat ship ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
Predicted: cat plane ship plane
Predicted: cat plane ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
GroundTruth: cat ship ship plane
Predicted: cat plane ship plane
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
System info (please complete the following information):
Describe the bug
When using BF16 to train CIFAR10 following deepspeedai/DeepSpeedExamples#651. I encounter accuracy loss with the following conditions:
To Reproduce
Steps to reproduce the behavior:
run_ds.sh
with multiple accelerators on the systemExpected behavior
BF16 train accuracy the same as single rank.
ds_report output
Screenshots
System info (please complete the following information):
6.4.0-rc2-2023-05-17-intel-next+ #1 SMP PREEMPT_DYNAMIC Wed May 17 15:36:48 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Launcher context
With DeepSpeed launcher
Docker context
No.
The text was updated successfully, but these errors were encountered: