Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: multi-node clarifications, and ssh into workers. #1363

Merged
merged 6 commits into from
Nov 11, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,16 +127,31 @@ This may show multiple clusters, if you have created several:

SSH into clusters
=================
To log into a cluster, simply run :code:`ssh <cluster_name>`:
Simply run :code:`ssh <cluster_name>` to log into a cluster:

.. code-block:: console

$ ssh mycluster

:ref:`Multi-node clusters <dist-jobs>` work too:

.. code-block:: console

# Assuming 3 nodes.

# Head node.
$ ssh mycluster

# Worker nodes.
$ ssh mycluster-worker1
$ ssh mycluster-worker2

The above are achieved by adding appropriate entries to ``~/.ssh/config``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this sentence, I'm not too clear what is being done. Either make this more specific (we create a new file and link it in ~/.ssh/config) or dont mention it at all (Sky is magic)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, thinking more about this I think it's okay, since this is quickstart and details are not necessary. Curious folks can always cat ~/.ssh/config to see what's going on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think we shouldn't put too much detail here, but a pointer.


Transfer files
===============

After a task's execution, use :code:`rsync` (or :code:`scp`) to download files (e.g., checkpoints):
After a task's execution, use :code:`rsync` or :code:`scp` to download files (e.g., checkpoints):

.. code-block:: console

Expand Down
86 changes: 65 additions & 21 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,35 +9,40 @@ provisioning and distributed execution on many VMs.
For example, here is a simple PyTorch Distributed training example:

.. code-block:: yaml
:emphasize-lines: 6-6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! didn't know about this directive


name: resnet-distributed-app
name: resnet-distributed-app

resources:
accelerators: V100:4
resources:
accelerators: V100:4

num_nodes: 2
num_nodes: 2

setup: |
pip3 install --upgrade pip
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
setup: |
pip3 install --upgrade pip
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz

run: |
cd pytorch-distributed-resnet
run: |
cd pytorch-distributed-resnet

num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20

In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
nodes. The :code:`setup` and :code:`run` commands are executed on both nodes.
nodes, with each node having 4 V100s.


Environment variables
-----------------------------------------

SkyPilot exposes these environment variables that can be accessed in a task's ``run`` commands:

Expand All @@ -54,3 +59,42 @@ SkyPilot exposes these environment variables that can be accessed in a task's ``
:code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`.
- :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction).


Launching a multi-node task (new cluster)
-----------------------------------------

When using ``sky launch`` to launch a multi-node task on **a new cluster**, the following happens in sequence:

1. Nodes are provisioned. (barrier)
2. Workdir/file_mounts are synced to all nodes. (barrier)
3. ``setup`` commands are executed on all nodes. (barrier)
4. ``run`` commands are executed on all nodes.

Launching a multi-node task (existing cluster)
-----------------------------------------

When using ``sky launch`` to launch a multi-node task on **an existing cluster**, the cluster may have more nodes than the current task's ``num_nodes`` requirement.

The following happens in sequence:

1. SkyPilot checks the runtime on all nodes are up-to-date. (barrier)
2. Workdir/file_mounts are synced to all nodes. (barrier)
3. ``setup`` commands are executed on **all nodes** of the cluster. (barrier)
4. ``run`` commands are executed on **the subset of nodes** scheduled to execute the task, which may be fewer than the cluster size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit scary to read that setup is run on all nodes ("what if my new setup conflicts with tasks already running?"). Should we add a tip box here reminding users that if you want to skip setup, you can use sky exec to execute just the run command?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, done.

Executing a task on the head node only
-----------------------------------------
To execute a task on the head node only (a common scenario for tools like
``mpirun``), use the ``SKY_NODE_RANK`` environment variable as follows:

.. code-block:: yaml
michaelzhiluo marked this conversation as resolved.
Show resolved Hide resolved

...

num_nodes: <n>

run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
# Launch the head-only command here.
fi