Skip to content

Commit

Permalink
Merge pull request #623 from argonne-lcf/aurora-saforem2
Browse files Browse the repository at this point in the history
docs: Add Aurora / `Megatron-DeepSpeed` docs
  • Loading branch information
felker authored Jan 19, 2025
2 parents 3d724d2 + c0b2c81 commit 56402d1
Show file tree
Hide file tree
Showing 2 changed files with 115 additions and 0 deletions.
114 changes: 114 additions & 0 deletions docs/aurora/data-science/frameworks/megatron-deepspeed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Megatron-DeepSpeed

[Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) is a
scalable, highly performant library for training large language models on _any_ GPU[^any].

In particular, it retains the core 4D parallelism[^4d] functionality of the
[NVIDIA / `Megatron-LM`](https://github.com/NVIDIA/Megatron-LM)
library, while leveraging the
[microsoft / `DeepSpeed`](https://github.com/microsoft/DeepSpeed) library for efficient
scaling and [🍋 saforem2 / `ezpz`](https://github.com/saforem2/ezpz)
for automated device + backend selection.

[^4d]: 4D parallelism refers to data (DP), tensor (TP), pipeline (PP), and
sequence (SP) parallelism degrees of freedom.

[^any]: Megatron-DeepSpeed is designed to work on any GPU, including NVIDIA
GPUs (NCCL), AMD GPUs (RCCL), and Intel XPUs (CCL).

## Getting Started

1. Clone the
[argonne-lcf / `Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed)
repository:

```bash
git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed
```

1. Setup your environment:

```bash
export PBS_O_WORKDIR=$(pwd)
source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh)
ezpz_setup_env
```

1. Install dependencies:

1. 🍋 [saforem2 / `ezpz`](https://github.com/saforem2/ezpz):

```bash
python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
```

1. [microsoft / `DeepSpeed`](https://github.com/microsoft/DeepSpeed):

```bash
python3 -m pip install deepspeed
```

1. Launch training:

```bash
# Before launching, `PBS_O_WORKDIR` should be set to Megatron-DeepSpeed's PATH
# and venv inside Megatron-DeepSpeed/venv should be activated.
TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh
```

This will launch a distributed pre-training run with:

- `NLAYERS=10`: Llama style model consisting of 10 layers

- `TP=2`: Split across 2 Tensor Parallel groups

- `DATA_FILE_LIST`: Using the
[Books corpus](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/data-lists/aurora/books.txt)
of the Dolma dataset

??? note "Overridable Options"

This is a simple subset of the overridable options.

The full list (as well as their default values) can be found in
[ALCF / `helpers.sh`](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/helpers.sh)

- `DTYPE`: Data type
- `DATA_FILE_LIST`: Data file list
- `FFN_HIDDEN_SIZE`: Feedforward Neural Network projection size
- `GRAD_ACC_STEPS`: Gradient accumulation steps
- `HEADS`: Number of attention heads
- `HIDDEN`: Hidden size
- `MICRO_BATCH`: Micro batch size
- `NO_FLASH_ATTN`: No Flash Attention
- `NLAYERS`: Number of layers
- `NUM_KV_HEAD`: Number of key-value heads
- `OPT`: Optimizer
- `adam`
- `adam8bit`
- `adamw`
- `adamwschedulefree`
- `apex.adam`
- `apex.sgd`
- `ds.fusedlamb`
- `ds.onebitlamb`
- `galoreadamw`
- `galoreadamw8bit`
- `galoreadamw8bitperlayer`
- `ipex.fusedlamb`
- `ipex.lamb`
- `shampoo`
- `sgd`
- `sgdschedulefree`
- `sophiag`
- `PP`: Pipeline parallelism degree
- `SEQ`: Sequence length
- `SP`: Sequence parallelism (Ulysses) degree
- `TP`: Tensor parallelism degree
- `TRAIN_TOKENS`: Number of training tokens
- `TRAIN_ITERS`: Number of training iterations
- `USE_ACTIVATION_CHECKPOINTING`: Use activation checkpointing
- `WEIGHT_DECAY`: Weight decay
- `ZERO_STAGE`: Zero stage

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ nav:
- Profiling: aurora/data-science/profiling_dl.md
- Frameworks:
- DeepSpeed: aurora/data-science/frameworks/deepspeed.md
- Megatron-DeepSpeed: aurora/data-science/frameworks/megatron-deepspeed.md
#- JAX: aurora/data-science/frameworks/jax.md
- PyTorch: aurora/data-science/frameworks/pytorch.md
- TensorFlow: aurora/data-science/frameworks/tensorflow.md
Expand Down

0 comments on commit 56402d1

Please sign in to comment.