-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: multi-node clarifications, and ssh into workers. #1363
Changes from 5 commits
e2c96f0
21107c5
35f50fc
7153073
0e4793e
f236e2c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,35 +9,40 @@ provisioning and distributed execution on many VMs. | |
For example, here is a simple PyTorch Distributed training example: | ||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 6-6 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice! didn't know about this directive |
||
|
||
name: resnet-distributed-app | ||
name: resnet-distributed-app | ||
|
||
resources: | ||
accelerators: V100:4 | ||
resources: | ||
accelerators: V100:4 | ||
|
||
num_nodes: 2 | ||
num_nodes: 2 | ||
|
||
setup: | | ||
pip3 install --upgrade pip | ||
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet | ||
cd pytorch-distributed-resnet | ||
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). | ||
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 | ||
mkdir -p data && mkdir -p saved_models && cd data && \ | ||
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz | ||
tar -xvzf cifar-10-python.tar.gz | ||
setup: | | ||
pip3 install --upgrade pip | ||
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet | ||
cd pytorch-distributed-resnet | ||
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). | ||
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 | ||
mkdir -p data && mkdir -p saved_models && cd data && \ | ||
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz | ||
tar -xvzf cifar-10-python.tar.gz | ||
|
||
run: | | ||
cd pytorch-distributed-resnet | ||
run: | | ||
cd pytorch-distributed-resnet | ||
|
||
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` | ||
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` | ||
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \ | ||
--master_port=8008 resnet_ddp.py --num_epochs 20 | ||
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` | ||
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` | ||
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \ | ||
--master_port=8008 resnet_ddp.py --num_epochs 20 | ||
|
||
In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2 | ||
nodes. The :code:`setup` and :code:`run` commands are executed on both nodes. | ||
nodes, with each node having 4 V100s. | ||
|
||
|
||
Environment variables | ||
----------------------------------------- | ||
|
||
SkyPilot exposes these environment variables that can be accessed in a task's ``run`` commands: | ||
|
||
|
@@ -54,3 +59,42 @@ SkyPilot exposes these environment variables that can be accessed in a task's `` | |
:code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`. | ||
- :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the | ||
task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction). | ||
|
||
|
||
Launching a multi-node task (new cluster) | ||
----------------------------------------- | ||
|
||
When using ``sky launch`` to launch a multi-node task on **a new cluster**, the following happens in sequence: | ||
|
||
1. Nodes are provisioned. (barrier) | ||
2. Workdir/file_mounts are synced to all nodes. (barrier) | ||
3. ``setup`` commands are executed on all nodes. (barrier) | ||
4. ``run`` commands are executed on all nodes. | ||
|
||
Launching a multi-node task (existing cluster) | ||
----------------------------------------- | ||
|
||
When using ``sky launch`` to launch a multi-node task on **an existing cluster**, the cluster may have more nodes than the current task's ``num_nodes`` requirement. | ||
|
||
The following happens in sequence: | ||
|
||
1. SkyPilot checks the runtime on all nodes are up-to-date. (barrier) | ||
2. Workdir/file_mounts are synced to all nodes. (barrier) | ||
3. ``setup`` commands are executed on **all nodes** of the cluster. (barrier) | ||
4. ``run`` commands are executed on **the subset of nodes** scheduled to execute the task, which may be fewer than the cluster size. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is a bit scary to read that setup is run on all nodes ("what if my new setup conflicts with tasks already running?"). Should we add a tip box here reminding users that if you want to skip setup, you can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, done. |
||
Executing a task on the head node only | ||
----------------------------------------- | ||
To execute a task on the head node only (a common scenario for tools like | ||
``mpirun``), use the ``SKY_NODE_RANK`` environment variable as follows: | ||
|
||
.. code-block:: yaml | ||
michaelzhiluo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
... | ||
|
||
num_nodes: <n> | ||
|
||
run: | | ||
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then | ||
# Launch the head-only command here. | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this sentence, I'm not too clear what is being done. Either make this more specific (we create a new file and link it in
~/.ssh/config
) or dont mention it at all (Sky is magic)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, thinking more about this I think it's okay, since this is quickstart and details are not necessary. Curious folks can always
cat ~/.ssh/config
to see what's going onThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think we shouldn't put too much detail here, but a pointer.