Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update zero.md tutorial #495

Merged
merged 4 commits into from
Nov 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/_tutorials/zero-offload.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,17 @@ For this tutorial, we will configure a 10 billion parameter GPT-2 model using th
We need to make changes to the Megatron-LM launch script and to the DeepSpeed configuration json.

### Megatron-LM GPT-2 launch script changes
We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model, which can be achieved by the following set of changes:
We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model with activation checkpointing enabled, which can be achieved by the following set of changes:

```bash
--model-parallel-size 1 \
--num-layers 50 \
--hidden-size 4096 \
--num-attention-heads 32 \
--batch-size 10 \
--d \
--deepspeed_config ds_zero_offload.config \
--cpu_optimizer \
--checkpoint-activations
```

Most of the flags in the changes above should be familiar if you have stepped through the Megatron-LM [tutorial](/tutorials/megatron/), except for the **_--cpu_optimizer_**. This flag informs the model script to pass a CPU-based Adam optimizer, rather than a GPU-based one, to DeepSpeed as the client optimizer. It is very important that this flag be used when training with ZeRO-Offload to ensure correct operation of the DeepSpeed engine.
Expand Down
7 changes: 3 additions & 4 deletions docs/_tutorials/zero.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ We demonstrate the benefits of ZeRO stage 1 by showing that it enables data para
--hidden-size 1600 \
--num-attention-heads 16 \
--batch-size 1 \
--d \
--deepspeed_config ds_zero_stage_1.config \
```

Expand All @@ -53,16 +52,16 @@ As seen above, we set two fields in the **zero_optimization** key. Specifically
From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.

### Training a 10B Parameter GPT-2 model
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.

```bash
--model-parallel-size 1 \
--num-layers 50 \
--hidden-size 4096 \
--num-attention-heads 32 \
--batch-size 1 \
--d \
--deepspeed_config ds_zero_stage_2.config \
--checkpoint-activations
```

Next, we need to update the DeepSpeed json configuration, as shown below, to enable ZeRO stage 2 optimizations:
Expand All @@ -80,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
}
```

In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now run the launch the training run.
In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.

Here is a screenshot of the training log:

Expand Down