Is zero_optimization stage 2 can't work with pipeline? #568

ghosthamlet · 2020-12-02T11:02:15Z

This is my config:

{
  "train_batch_size" : 2,
  "train_micro_batch_size_per_gpu" : 1,
"zero_optimization": {
    "stage": 2,
    "cpu_offload": false,
    "contiguous_gradients": false,
    "overlap_comm": false,
    "reduce_bucket_size": 5000000,
    "allgather_bucket_size": 5000000
    },
"fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
    },

  "steps_per_print" : 10,
  "wall_clock_breakdown" : false
}

is it because pipeline already partitioned optimizer state and gradient state, so no need to use zero_optimization partition? (After checked the code, the answer is no, pipeline partition the whole model(include both layers/activation/gradient/state, zero_optimization can further partition gradient/state inside pipeline layers.) but if they failed to work together, then how to do cpu_offload?

error message is:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    train_pipe(args)
  File "train.py", line 151, in train_pipe
    loss = engine.train_batch()
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 268, in train_batch
    self._exec_schedule(sched)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1194, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 622, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/ps/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: The parameter 11 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported

When contiguous_gradients is true, error message is:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    train_pipe(args)
  File "train.py", line 151, in train_pipe
    loss = engine.train_batch()
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 268, in train_batch
    self._exec_schedule(sched)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1194, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/ps/.local/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 622, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/ps/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: 'NoneType' object is not subscriptable

When Zero stage is 1, the pipeline works fine.

Environment:
python 3.6
torch 1.6.0
deepspeed 0.3.7

The text was updated successfully, but these errors were encountered:

StellaAthena · 2021-01-14T14:45:52Z

Have you tried dropping "reduce_bucket_size": 5000000,
"allgather_bucket_size": 5000000 from your configs? I had to do that to get it working

ghosthamlet · 2021-01-15T05:09:24Z

Thank you very much, I will try it.

StellaAthena · 2021-01-15T05:32:21Z

No problem! You can find my code here.

ghosthamlet · 2021-01-15T12:48:37Z

Thanks, your gpt-neox is a wonderful project.

ghosthamlet · 2021-01-25T02:07:51Z

Maybe fixed by #677

StellaAthena · 2021-01-25T05:08:53Z

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

gongjingcs · 2021-03-15T03:43:58Z

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully？

StellaAthena · 2021-03-15T05:56:38Z

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully？

Did you use my code or did you use the official DS code? I believe that the GPT-NeoX + DeeperSpeed codebase has some necessary bug fixes that are yet to be integrated into the master branch of DeepSpeed.

gongjingcs · 2021-03-15T06:14:50Z

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully？

Did you use my code or did you use the official DS code? I believe that the GPT-NeoX + DeeperSpeed codebase has some necessary bug fixes that are yet to be integrated into the master branch of DeepSpeed.

Hi, I use this example: https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism

gongjingcs · 2021-03-15T06:25:27Z

Thanks, your gpt-neox is a wonderful project.

Why thank you! I'm quite excited about it :)

hi, StellaAthena, According to my experiment, zero2 is not compatible with pipeline, can you verify that your example really uses zero stage 2 successfully？

Did you use my code or did you use the official DS code? I believe that the GPT-NeoX + DeeperSpeed codebase has some necessary bug fixes that are yet to be integrated into the master branch of DeepSpeed.

could you please provide your pull requests of deepspeed bug fixes here？you mean this：https://github.com/microsoft/DeepSpeed/pull/677/files ？thanks alot

ghosthamlet closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is zero_optimization stage 2 can't work with pipeline? #568

Is zero_optimization stage 2 can't work with pipeline? #568

ghosthamlet commented Dec 2, 2020 •

edited

Loading

StellaAthena commented Jan 14, 2021

ghosthamlet commented Jan 15, 2021 •

edited

Loading

StellaAthena commented Jan 15, 2021

ghosthamlet commented Jan 15, 2021

ghosthamlet commented Jan 25, 2021

StellaAthena commented Jan 25, 2021

gongjingcs commented Mar 15, 2021

StellaAthena commented Mar 15, 2021

gongjingcs commented Mar 15, 2021

gongjingcs commented Mar 15, 2021 •

edited

Loading

Is zero_optimization stage 2 can't work with pipeline? #568

Is zero_optimization stage 2 can't work with pipeline? #568

Comments

ghosthamlet commented Dec 2, 2020 • edited Loading

StellaAthena commented Jan 14, 2021

ghosthamlet commented Jan 15, 2021 • edited Loading

StellaAthena commented Jan 15, 2021

ghosthamlet commented Jan 15, 2021

ghosthamlet commented Jan 25, 2021

StellaAthena commented Jan 25, 2021

gongjingcs commented Mar 15, 2021

StellaAthena commented Mar 15, 2021

gongjingcs commented Mar 15, 2021

gongjingcs commented Mar 15, 2021 • edited Loading

ghosthamlet commented Dec 2, 2020 •

edited

Loading

ghosthamlet commented Jan 15, 2021 •

edited

Loading

gongjingcs commented Mar 15, 2021 •

edited

Loading