Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add Aurora / Megatron-DeepSpeed docs #623

Merged
merged 13 commits into from
Jan 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions docs/aurora/data-science/frameworks/megatron-deepspeed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Megatron-DeepSpeed

[Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) is a
scalable, highly performant library for training large language models on _any_ GPU[^any].

In particular, it retains the core 4D parallelism[^4d] functionality of the
[NVIDIA / `Megatron-LM`](https://github.com/NVIDIA/Megatron-LM)
library, while leveraging the
[microsoft / `DeepSpeed`](https://github.com/microsoft/DeepSpeed) library for efficient
scaling and [🍋 saforem2 / `ezpz`](https://github.com/saforem2/ezpz)
for automated device + backend selection.

[^4d]: 4D parallelism refers to data (DP), tensor (TP), pipeline (PP), and
sequence (SP) parallelism degrees of freedom.

[^any]: Megatron-DeepSpeed is designed to work on any GPU, including NVIDIA
GPUs (NCCL), AMD GPUs (RCCL), and Intel XPUs (CCL).

## Getting Started

1. Clone the
[argonne-lcf / `Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed)
repository:

```bash
git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed
```

1. Setup your environment:

```bash
export PBS_O_WORKDIR=$(pwd)
source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh)
ezpz_setup_env
```

1. Install dependencies:

1. 🍋 [saforem2 / `ezpz`](https://github.com/saforem2/ezpz):

```bash
python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
```

1. [microsoft / `DeepSpeed`](https://github.com/microsoft/DeepSpeed):

```bash
python3 -m pip install deepspeed
```

1. Launch training:

```bash
# Before launching, `PBS_O_WORKDIR` should be set to Megatron-DeepSpeed's PATH
# and venv inside Megatron-DeepSpeed/venv should be activated.
TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh
saforem2 marked this conversation as resolved.
Show resolved Hide resolved
```

This will launch a distributed pre-training run with:

- `NLAYERS=10`: Llama style model consisting of 10 layers

- `TP=2`: Split across 2 Tensor Parallel groups

- `DATA_FILE_LIST`: Using the
[Books corpus](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/data-lists/aurora/books.txt)
of the Dolma dataset

saforem2 marked this conversation as resolved.
Show resolved Hide resolved
??? note "Overridable Options"

This is a simple subset of the overridable options.

The full list (as well as their default values) can be found in
[ALCF / `helpers.sh`](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/helpers.sh)

- `DTYPE`: Data type
- `DATA_FILE_LIST`: Data file list
- `FFN_HIDDEN_SIZE`: Feedforward Neural Network projection size
- `GRAD_ACC_STEPS`: Gradient accumulation steps
- `HEADS`: Number of attention heads
- `HIDDEN`: Hidden size
- `MICRO_BATCH`: Micro batch size
- `NO_FLASH_ATTN`: No Flash Attention
- `NLAYERS`: Number of layers
- `NUM_KV_HEAD`: Number of key-value heads
- `OPT`: Optimizer
- `adam`
- `adam8bit`
- `adamw`
- `adamwschedulefree`
- `apex.adam`
- `apex.sgd`
- `ds.fusedlamb`
- `ds.onebitlamb`
- `galoreadamw`
- `galoreadamw8bit`
- `galoreadamw8bitperlayer`
- `ipex.fusedlamb`
- `ipex.lamb`
- `shampoo`
- `sgd`
- `sgdschedulefree`
- `sophiag`
- `PP`: Pipeline parallelism degree
- `SEQ`: Sequence length
- `SP`: Sequence parallelism (Ulysses) degree
- `TP`: Tensor parallelism degree
- `TRAIN_TOKENS`: Number of training tokens
- `TRAIN_ITERS`: Number of training iterations
- `USE_ACTIVATION_CHECKPOINTING`: Use activation checkpointing
- `WEIGHT_DECAY`: Weight decay
- `ZERO_STAGE`: Zero stage

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ nav:
- Profiling: aurora/data-science/profiling_dl.md
- Frameworks:
- DeepSpeed: aurora/data-science/frameworks/deepspeed.md
- Megatron-DeepSpeed: aurora/data-science/frameworks/megatron-deepspeed.md
#- JAX: aurora/data-science/frameworks/jax.md
- PyTorch: aurora/data-science/frameworks/pytorch.md
- TensorFlow: aurora/data-science/frameworks/tensorflow.md
Expand Down
Loading