deprecate sharded_ddp training argument (#24825)

* deprecate fairscale's ShardedDDP * fix code style * roll back * deprecate the `sharded_ddp` training argument --------- Co-authored-by: jihuazhong <[email protected]>
huggingface · Jul 17, 2023 · 8ba26c1 · 8ba26c1
1 parent 5bb4430
commit 8ba26c1
Show file tree

Hide file tree

Showing 4 changed files with 11 additions and 148 deletions.
diff --git a/ISSUES.md b/ISSUES.md
@@ -158,7 +158,7 @@ You are not required to read the following guidelines before opening an issue. H
     --do_train --n_train 500 --num_train_epochs 1 \
     --per_device_train_batch_size 1  --freeze_embeds \
     --src_lang en_XX --tgt_lang ro_RO --task translation \
-    --fp16 --sharded_ddp
+    --fp16
    ```
 
    If you don't break it up, one has to scroll horizontally which often makes it quite difficult to quickly see what's happening.

diff --git a/docs/source/en/main_classes/trainer.md b/docs/source/en/main_classes/trainer.md
@@ -295,7 +295,7 @@ Also if you do set this environment variable it's the best to set it in your `~/
 The [`Trainer`] has been extended to support libraries that may dramatically improve your training
 time and fit much bigger models.
 
-Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed), [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) and [FairScale](https://github.com/facebookresearch/fairscale/), which implement parts of the paper [ZeRO: Memory Optimizations
+Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html), which implement parts of the paper [ZeRO: Memory Optimizations
 Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He](https://arxiv.org/abs/1910.02054).
 
 This provided support is new and experimental as of this writing. While the support for DeepSpeed and PyTorch FSDP is active and we welcome issues around it, we don't support the FairScale integration anymore since it has been integrated in PyTorch main (see the [PyTorch FSDP integration](#pytorch-fully-sharded-data-parallel))
@@ -304,15 +304,14 @@ This provided support is new and experimental as of this writing. While the supp
 
 ### CUDA Extension Installation Notes
 
-As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
+As of this writing, Deepspeed require compilation of CUDA C++ code, before it can be used.
 
-While all installation issues should be dealt with through the corresponding GitHub Issues of [FairScale](https://github.com/facebookresearch/fairscale/issues) and [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building
+While all installation issues should be dealt with through the corresponding GitHub Issues of [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building
 any PyTorch extension that needs to build CUDA extensions.
 
-Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
+Therefore, if you encounter a CUDA-related build issue while doing the following:
 
 ```bash
-pip install fairscale
 pip install deepspeed
 ```
 
@@ -410,145 +409,6 @@ should find `gcc-7` (and `g++7`) and then the build will succeed.
 
 As always make sure to edit the paths in the example to match your situation.
 
-### FairScale
-
-<Tip warning={true}>
-
-This integration is not supported anymore, we recommend you either use DeepSpeed or PyTorch FSDP.
-
-</Tip>
-
-By integrating [FairScale](https://github.com/facebookresearch/fairscale/) the [`Trainer`]
-provides support for the following features from [the ZeRO paper](https://arxiv.org/abs/1910.02054):
-
-1. Optimizer State Sharding
-2. Gradient Sharding
-3. Model Parameters Sharding (new and very experimental)
-4. CPU offload (new and very experimental)
-
-You will need at least two GPUs to use this feature.
-
-
-**Installation**:
-
-Install the library via pypi:
-
-```bash
-pip install fairscale
-```
-
-or via `transformers`' `extras`:
-
-```bash
-pip install transformers[fairscale]
-```
-
-(available starting from `transformers==4.6.0`) or find more details on [the FairScale's GitHub page](https://github.com/facebookresearch/fairscale/#installation).
-
-If you're still struggling with the build, first make sure to read [CUDA Extension Installation Notes](#zero-install-notes).
-
-If it's still not resolved the build issue, here are a few more ideas.
-
-`fairscale` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
-with it, you may want to try one of:
-
-```bash
-pip install fairscale --no-build-isolation .
-```
-
-or:
-
-```bash
-git clone https://github.com/facebookresearch/fairscale/
-cd fairscale
-rm -r dist build
-python setup.py bdist_wheel
-pip uninstall -y fairscale
-pip install dist/fairscale-*.whl
-```
-
-`fairscale` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:
-
-```bash
-pip uninstall -y fairscale; pip install fairscale --pre \
--f https://download.pytorch.org/whl/nightly/cu110/torch_nightly \
---no-cache --no-build-isolation
-```
-
-or:
-
-```bash
-pip install -v --disable-pip-version-check . \
--f https://download.pytorch.org/whl/nightly/cu110/torch_nightly --pre
-```
-
-Of course, adjust the urls to match the cuda version you use.
-
-If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
-[FairScale](https://github.com/facebookresearch/fairscale/issues).
-
-
-
-**Usage**:
-
-To use the first version of Sharded data-parallelism, add `--sharded_ddp simple` to the command line arguments, and
-make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.
-
-For example here is how you could use it for `run_translation.py` with 2 GPUs:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
---model_name_or_path t5-small --per_device_train_batch_size 1   \
---output_dir output_dir --overwrite_output_dir \
---do_train --max_train_samples 500 --num_train_epochs 1 \
---dataset_name wmt16 --dataset_config "ro-en" \
---source_lang en --target_lang ro \
---fp16 --sharded_ddp simple
-```
-
-Notes:
-
-- This feature requires distributed training (so multiple GPUs).
-- It is not implemented for TPUs.
-- It works with `--fp16` too, to make things even faster.
-- One of the main benefits of enabling `--sharded_ddp simple` is that it uses a lot less GPU memory, so you should be
-  able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
-  significantly shorter training time.
-
-3. To use the second version of Sharded data-parallelism, add `--sharded_ddp zero_dp_2` or `--sharded_ddp zero_dp_3` to the command line arguments, and make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already.
-
-For example here is how you could use it for `run_translation.py` with 2 GPUs:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \
---model_name_or_path t5-small --per_device_train_batch_size 1   \
---output_dir output_dir --overwrite_output_dir \
---do_train --max_train_samples 500 --num_train_epochs 1 \
---dataset_name wmt16 --dataset_config "ro-en" \
---source_lang en --target_lang ro \
---fp16 --sharded_ddp zero_dp_2
-```
-
-`zero_dp_2` is an optimized version of the simple wrapper, while `zero_dp_3` fully shards model weights,
-gradients and optimizer states.
-
-Both are compatible with adding `cpu_offload` to enable ZeRO-offload (activate it like this: `--sharded_ddp "zero_dp_2 cpu_offload"`).
-
-Notes:
-
-- This feature requires distributed training (so multiple GPUs).
-- It is not implemented for TPUs.
-- It works with `--fp16` too, to make things even faster.
-- The `cpu_offload` additional option requires `--fp16`.
-- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
-  some bugs you encounter may have been fixed there already.
-
-Known caveats:
-
-- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script.
-- Using `--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
-  `FullyShardedDataParallelism` of fairscale. It should be used with the option `auto_wrap` if you are not
-  doing this yourself: `--sharded_ddp "zero_dp_3 auto_wrap"`.
 
 ### PyTorch Fully Sharded Data parallel
 

diff --git a/docs/source/en/perf_train_gpu_many.md b/docs/source/en/perf_train_gpu_many.md
@@ -241,7 +241,6 @@ If you pay close attention the way ZeRO partitions the model's weights - it look
 Implementations:
 
 - [DeepSpeed](https://www.deepspeed.ai/features/#the-zero-redundancy-optimizer) ZeRO-DP stages 1+2+3
-- [Fairscale](https://github.com/facebookresearch/fairscale/#optimizer-state-sharding-zero) ZeRO-DP stages 1+2+3
 - [`transformers` integration](main_classes/trainer#trainer-integrations)
 
 ## Naive Model Parallelism (Vertical) and Pipeline Parallelism
@@ -294,7 +293,6 @@ There are 2 groups of solutions - the traditional Pipeline API and the more mode
 
 Traditional Pipeline API solutions:
 - PyTorch
-- FairScale
 - DeepSpeed
 - Megatron-LM
 
@@ -312,7 +310,6 @@ We are yet to experiment with Varuna and SageMaker but their papers report that
 
 Implementations:
 - [Pytorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py)
-- [FairScale](https://fairscale.readthedocs.io/en/latest/tutorials/pipe.html)
 - [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/)
 - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API.
 - [Varuna](https://github.com/microsoft/varuna)

diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py
@@ -1460,6 +1460,12 @@ def __post_init__(self):
                 " during training"
             )
 
+        if not (self.sharded_ddp == "" or not self.sharded_ddp):
+            warnings.warn(
+                "using `sharded_ddp` is deprecated and will be removed in version 4.33"
+                " of 🤗 Transformers. Use `fsdp` instead",
+                FutureWarning,
+            )
         if isinstance(self.sharded_ddp, bool):
             self.sharded_ddp = "simple" if self.sharded_ddp else ""
         if isinstance(self.sharded_ddp, str):