Skip to content

Commit

Permalink
[docs] Add newer examples for AI tutorial and distributed training (#…
Browse files Browse the repository at this point in the history
…4509)

* Update tutorial and distributed training examples.

* Add examples link

* add rdvz
  • Loading branch information
romilbhardwaj authored Dec 30, 2024
1 parent 13501e2 commit 3715be2
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 47 deletions.
48 changes: 24 additions & 24 deletions docs/source/getting-started/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,20 @@

Tutorial: AI Training
======================
This example uses SkyPilot to train a Transformer-based language model from HuggingFace.
This example uses SkyPilot to train a GPT-like model (inspired by Karpathy's `minGPT <https://github.com/karpathy/minGPT>`_) with Distributed Data Parallel (DDP) in PyTorch.

First, define a :ref:`task YAML <yaml-spec>` with the resource requirements, the setup commands,
We define a :ref:`task YAML <yaml-spec>` with the resource requirements, the setup commands,
and the commands to run:

.. code-block:: yaml
# dnn.yaml
# train.yaml
name: huggingface
name: minGPT-ddp
resources:
accelerators: V100:4
cpus: 4+
accelerators: L4:4 # Or A100:8, H100:8
# Optional: upload a working directory to remote ~/sky_workdir.
# Commands in "setup" and "run" will be executed under it.
Expand All @@ -30,38 +31,37 @@ and the commands to run:
# ~/.netrc: ~/.netrc
setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_steps 50 \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16
cd examples/mingpt
export LOGLEVEL=INFO
echo "Starting minGPT-ddp training"
torchrun \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
main.py
.. tip::

In the YAML, the ``workdir`` and ``file_mounts`` fields are commented out. To
learn about how to use them to mount local dirs/files or object store buckets
(S3, GCS, R2) into your cluster, see :ref:`sync-code-artifacts`.

.. tip::

The ``SKYPILOT_NUM_GPUS_PER_NODE`` environment variable is automatically set by SkyPilot to the number of GPUs per node. See :ref:`env-vars` for more.

Then, launch training:

.. code-block:: console
$ sky launch -c lm-cluster dnn.yaml
$ sky launch -c mingpt train.yaml
This will provision the cheapest cluster with the required resources, execute the setup
commands, then execute the run commands.
Expand Down
48 changes: 25 additions & 23 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,40 @@ Distributed Multi-Node Jobs
SkyPilot supports multi-node cluster
provisioning and distributed execution on many nodes.

For example, here is a simple PyTorch Distributed training example:
For example, here is a simple example to train a GPT-like model (inspired by Karpathy's `minGPT <https://github.com/karpathy/minGPT>`_) across 2 nodes with Distributed Data Parallel (DDP) in PyTorch.

.. code-block:: yaml
:emphasize-lines: 6-6,21-21,23-26
:emphasize-lines: 6,19,23-24,26
name: resnet-distributed-app
name: minGPT-ddp
resources:
accelerators: A100:8
resources:
accelerators: A100:8
num_nodes: 2
num_nodes: 2
setup: |
pip3 install --upgrade pip
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
run: |
cd pytorch-distributed-resnet
run: |
cd examples/mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
torchrun \
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_port=12375 \
resnet_ddp.py --num_epochs 20
--master_addr=$MASTER_ADDR \
--node_rank=${SKYPILOT_NODE_RANK} \
--master_port=8008 \
main.py
In the above,

Expand All @@ -55,6 +56,7 @@ In the above,

ulimit -n 65535

You can find more `distributed training examples <https://github.com/skypilot-org/skypilot/tree/master/examples/distributed-pytorch>`_ (including `using rdvz backend for pytorch <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed-pytorch/train-rdzv.yaml>`_) in our `GitHub repository <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.

Environment variables
-----------------------------------------
Expand Down

0 comments on commit 3715be2

Please sign in to comment.