Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BFloat16 training accuracy issue with zero 0/1 (the more rank, the less accuracy) in CIFAR10 #3979

Open
delock opened this issue Jul 18, 2023 · 0 comments
Labels
bug Something isn't working training

Comments

@delock
Copy link
Collaborator

delock commented Jul 18, 2023

Describe the bug
When using BF16 to train CIFAR10 following deepspeedai/DeepSpeedExamples#651. I encounter accuracy loss with the following conditions:

  1. BF16 training
  2. zero stage 0 and 1
  3. world_size >= 2

To Reproduce
Steps to reproduce the behavior:

  1. Use PR Enable non-CUDA device for CIFAR10 and HelloDeepSpeed training example DeepSpeedExamples#651
  2. Goto DeepSpeedExamples/training/cifar/
  3. Change ds_config.json to change 'fp16' to 'bf16'
  4. run the run_ds.sh with multiple accelerators on the system
  5. Observe accuracy loss with multiple ranks. For 2 ranks, accuracy is 52%, much less than 57% of 1 rank. For 8 ranks, accuracy drops to around 43%
  6. This can be observed from two CUDA cards. The 8 rank result is observed from a WIP CPU training branch. Should also be observed on 8 CUDA card system
  7. When change data type to fp32 (on CPU system), there is no issue with 2 ranks or 8 ranks. When change to zero stage 2, there is no accuracy issue.

Expected behavior
BF16 train accuracy the same as single rank.

ds_report output

[2023-07-18 12:16:47,983] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2023-07-18 12:16:48,129] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/gma/anaconda3/envs/dscpu/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/gma/DeepSpeed/deepspeed']
deepspeed info ................... 0.9.4+046afced, 046afced, master
deepspeed wheel compiled w. ...... torch 2.0

Screenshots

Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
Finished Training
GroundTruth:    cat  ship  ship plane
GroundTruth:    cat  ship  ship plane
GroundTruth:    cat  ship  ship plane
Predicted:    cat plane  ship plane
Predicted:    cat plane  ship plane
GroundTruth:    cat  ship  ship planePredicted:
   cat plane  ship plane
GroundTruth:    cat  ship  ship plane
GroundTruth:    cat  ship  ship plane
Predicted:    cat plane  ship plane
Predicted:    cat plane  ship plane
Predicted:    cat plane  ship plane
GroundTruth:    cat  ship  ship plane
Predicted:    cat plane  ship plane
GroundTruth:    cat  ship  ship plane
Predicted:    cat plane  ship plane
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %
Accuracy of the network on the 10000 test images: 43 %

System info (please complete the following information):

  • OS: 6.4.0-rc2-2023-05-17-intel-next+ #1 SMP PREEMPT_DYNAMIC Wed May 17 15:36:48 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
  • 1 machine with 2 RTX3090 cards / 1 machine with 2 SPR 48 core configured with SNC4
  • Python version: 3.11.3

Launcher context
With DeepSpeed launcher

Docker context
No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant