Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is zero_optimization stage 2 can't work with pipeline? #568

Closed
ghosthamlet opened this issue Dec 2, 2020 · 10 comments
Closed

Is zero_optimization stage 2 can't work with pipeline? #568

ghosthamlet opened this issue Dec 2, 2020 · 10 comments

Comments

@ghosthamlet
Copy link
Contributor

ghosthamlet commented Dec 2, 2020

This is my config:

{
  "train_batch_size" : 2,
  "train_micro_batch_size_per_gpu" : 1,
"zero_optimization": {
    "stage": 2,
    "cpu_offload": false,
    "contiguous_gradients": false,
    "overlap_comm": false,
    "reduce_bucket_size": 5000000,
    "allgather_bucket_size": 5000000
    },
"fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
    },

  "steps_per_print" : 10,
  "wall_clock_breakdown" : false
}

is it because pipeline already partitioned optimizer state and gradient state, so no need to use zero_optimization partition? (After checked the code, the answer is no, pipeline partition the whole model(include both layers/activation/gradient/state, zero_optimization can further partition gradient/state inside pipeline layers.) but if they failed to work together, then how to do cpu_offload?

error message is:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    train_pipe(args)
  File "train.py", line 151, in train_pipe
    loss = engine.train_batch()
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 268, in train_batch
    self._exec_schedule(sched)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1194, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 622, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/ps/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: The parameter 11 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported

When contiguous_gradients is true, error message is:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    train_pipe(args)
  File "train.py", line 151, in train_pipe
    loss = engine.train_batch()
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 268, in train_batch
    self._exec_schedule(sched)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1194, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 622, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/ps/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: 'NoneType' object is not subscriptable

When Zero stage is 1, the pipeline works fine.

Environment:
python 3.6
torch 1.6.0
deepspeed 0.3.7

@StellaAthena
Copy link

Have you tried dropping "reduce_bucket_size": 5000000,
"allgather_bucket_size": 5000000 from your configs? I had to do that to get it working

@ghosthamlet
Copy link
Contributor Author

ghosthamlet commented Jan 15, 2021

Thank you very much, I will try it.

@StellaAthena
Copy link

No problem! You can find my code here.

@ghosthamlet
Copy link
Contributor Author

Thanks, your gpt-neox is a wonderful project.

@ghosthamlet
Copy link
Contributor Author

Maybe fixed by #677

@StellaAthena
Copy link

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

@gongjingcs
Copy link

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully?

@StellaAthena
Copy link

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully?

Did you use my code or did you use the official DS code? I believe that the GPT-NeoX + DeeperSpeed codebase has some necessary bug fixes that are yet to be integrated into the master branch of DeepSpeed.

@gongjingcs
Copy link

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully?

Did you use my code or did you use the official DS code? I believe that the GPT-NeoX + DeeperSpeed codebase has some necessary bug fixes that are yet to be integrated into the master branch of DeepSpeed.

Hi, I use this example: https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism

@gongjingcs
Copy link

gongjingcs commented Mar 15, 2021

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully?

Did you use my code or did you use the official DS code? I believe that the GPT-NeoX + DeeperSpeed codebase has some necessary bug fixes that are yet to be integrated into the master branch of DeepSpeed.

could you please provide your pull requests of deepspeed bug fixes here?you mean this:https://github.com/microsoft/DeepSpeed/pull/677/files ?thanks alot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants