Skip to content

Commit

Permalink
deprecate sharded_ddp training argument (#24825)
Browse files Browse the repository at this point in the history
* deprecate fairscale's ShardedDDP

* fix code style

* roll back

* deprecate the `sharded_ddp` training argument

---------

Co-authored-by: jihuazhong <[email protected]>
  • Loading branch information
statelesshz and jihuazhong authored Jul 17, 2023
1 parent 5bb4430 commit 8ba26c1
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 148 deletions.
2 changes: 1 addition & 1 deletion ISSUES.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ You are not required to read the following guidelines before opening an issue. H
--do_train --n_train 500 --num_train_epochs 1 \
--per_device_train_batch_size 1 --freeze_embeds \
--src_lang en_XX --tgt_lang ro_RO --task translation \
--fp16 --sharded_ddp
--fp16
```

If you don't break it up, one has to scroll horizontally which often makes it quite difficult to quickly see what's happening.
Expand Down
148 changes: 4 additions & 144 deletions docs/source/en/main_classes/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,7 @@ Also if you do set this environment variable it's the best to set it in your `~/
The [`Trainer`] has been extended to support libraries that may dramatically improve your training
time and fit much bigger models.

Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed), [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) and [FairScale](https://github.com/facebookresearch/fairscale/), which implement parts of the paper [ZeRO: Memory Optimizations
Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html), which implement parts of the paper [ZeRO: Memory Optimizations
Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He](https://arxiv.org/abs/1910.02054).

This provided support is new and experimental as of this writing. While the support for DeepSpeed and PyTorch FSDP is active and we welcome issues around it, we don't support the FairScale integration anymore since it has been integrated in PyTorch main (see the [PyTorch FSDP integration](#pytorch-fully-sharded-data-parallel))
Expand All @@ -304,15 +304,14 @@ This provided support is new and experimental as of this writing. While the supp

### CUDA Extension Installation Notes

As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
As of this writing, Deepspeed require compilation of CUDA C++ code, before it can be used.

While all installation issues should be dealt with through the corresponding GitHub Issues of [FairScale](https://github.com/facebookresearch/fairscale/issues) and [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building
While all installation issues should be dealt with through the corresponding GitHub Issues of [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building
any PyTorch extension that needs to build CUDA extensions.

Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
Therefore, if you encounter a CUDA-related build issue while doing the following:

```bash
pip install fairscale
pip install deepspeed
```

Expand Down Expand Up @@ -410,145 +409,6 @@ should find `gcc-7` (and `g++7`) and then the build will succeed.

As always make sure to edit the paths in the example to match your situation.

### FairScale

<Tip warning={true}>

This integration is not supported anymore, we recommend you either use DeepSpeed or PyTorch FSDP.

</Tip>

By integrating [FairScale](https://github.com/facebookresearch/fairscale/) the [`Trainer`]
provides support for the following features from [the ZeRO paper](https://arxiv.org/abs/1910.02054):

1. Optimizer State Sharding
2. Gradient Sharding
3. Model Parameters Sharding (new and very experimental)
4. CPU offload (new and very experimental)

You will need at least two GPUs to use this feature.


**Installation**:

Install the library via pypi:

```bash
pip install fairscale
```

or via `transformers`' `extras`:

```bash
pip install transformers[fairscale]
```

(available starting from `transformers==4.6.0`) or find more details on [the FairScale's GitHub page](https://github.com/facebookresearch/fairscale/#installation).

If you're still struggling with the build, first make sure to read [CUDA Extension Installation Notes](#zero-install-notes).

If it's still not resolved the build issue, here are a few more ideas.

`fairscale` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
with it, you may want to try one of:

```bash
pip install fairscale --no-build-isolation .
```

or:

```bash
git clone https://github.com/facebookresearch/fairscale/
cd fairscale
rm -r dist build
python setup.py bdist_wheel
pip uninstall -y fairscale
pip install dist/fairscale-*.whl
```

`fairscale` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:

```bash
pip uninstall -y fairscale; pip install fairscale --pre \
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly \
--no-cache --no-build-isolation
```

or:

```bash
pip install -v --disable-pip-version-check . \
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly --pre
```

Of course, adjust the urls to match the cuda version you use.

If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
[FairScale](https://github.com/facebookresearch/fairscale/issues).



**Usage**:

To use the first version of Sharded data-parallelism, add `--sharded_ddp simple` to the command line arguments, and
make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.

For example here is how you could use it for `run_translation.py` with 2 GPUs:

```bash
python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro \
--fp16 --sharded_ddp simple
```

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with `--fp16` too, to make things even faster.
- One of the main benefits of enabling `--sharded_ddp simple` is that it uses a lot less GPU memory, so you should be
able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
significantly shorter training time.

3. To use the second version of Sharded data-parallelism, add `--sharded_ddp zero_dp_2` or `--sharded_ddp zero_dp_3` to the command line arguments, and make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.

For example here is how you could use it for `run_translation.py` with 2 GPUs:

```bash
python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro \
--fp16 --sharded_ddp zero_dp_2
```

`zero_dp_2` is an optimized version of the simple wrapper, while `zero_dp_3` fully shards model weights,
gradients and optimizer states.

Both are compatible with adding `cpu_offload` to enable ZeRO-offload (activate it like this: `--sharded_ddp "zero_dp_2 cpu_offload"`).

Notes:

- This feature requires distributed training (so multiple GPUs).
- It is not implemented for TPUs.
- It works with `--fp16` too, to make things even faster.
- The `cpu_offload` additional option requires `--fp16`.
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
some bugs you encounter may have been fixed there already.

Known caveats:

- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script.
- Using `--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
`FullyShardedDataParallelism` of fairscale. It should be used with the option `auto_wrap` if you are not
doing this yourself: `--sharded_ddp "zero_dp_3 auto_wrap"`.

### PyTorch Fully Sharded Data parallel

Expand Down
3 changes: 0 additions & 3 deletions docs/source/en/perf_train_gpu_many.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,6 @@ If you pay close attention the way ZeRO partitions the model's weights - it look
Implementations:

- [DeepSpeed](https://www.deepspeed.ai/features/#the-zero-redundancy-optimizer) ZeRO-DP stages 1+2+3
- [Fairscale](https://github.com/facebookresearch/fairscale/#optimizer-state-sharding-zero) ZeRO-DP stages 1+2+3
- [`transformers` integration](main_classes/trainer#trainer-integrations)

## Naive Model Parallelism (Vertical) and Pipeline Parallelism
Expand Down Expand Up @@ -294,7 +293,6 @@ There are 2 groups of solutions - the traditional Pipeline API and the more mode

Traditional Pipeline API solutions:
- PyTorch
- FairScale
- DeepSpeed
- Megatron-LM

Expand All @@ -312,7 +310,6 @@ We are yet to experiment with Varuna and SageMaker but their papers report that

Implementations:
- [Pytorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py)
- [FairScale](https://fairscale.readthedocs.io/en/latest/tutorials/pipe.html)
- [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/)
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API.
- [Varuna](https://github.com/microsoft/varuna)
Expand Down
6 changes: 6 additions & 0 deletions src/transformers/training_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -1460,6 +1460,12 @@ def __post_init__(self):
" during training"
)

if not (self.sharded_ddp == "" or not self.sharded_ddp):
warnings.warn(
"using `sharded_ddp` is deprecated and will be removed in version 4.33"
" of 🤗 Transformers. Use `fsdp` instead",
FutureWarning,
)
if isinstance(self.sharded_ddp, bool):
self.sharded_ddp = "simple" if self.sharded_ddp else ""
if isinstance(self.sharded_ddp, str):
Expand Down

0 comments on commit 8ba26c1

Please sign in to comment.