skypilot-org · concretevitamin · Nov 11, 2022 · Nov 2, 2022 · Nov 2, 2022 · Nov 3, 2022
diff --git a/docs/source/getting-started/quickstart.rst b/docs/source/getting-started/quickstart.rst
@@ -127,16 +127,31 @@ This may show multiple clusters, if you have created several:
 
 SSH into clusters
 =================
-To log into a cluster, simply run :code:`ssh <cluster_name>`:
+Simply run :code:`ssh <cluster_name>` to log into a cluster:
 
 .. code-block:: console
 
   $ ssh mycluster
 
+:ref:`Multi-node clusters <dist-jobs>` work too:
+
+.. code-block:: console
+
+  # Assuming 3 nodes.
+
+  # Head node.
+  $ ssh mycluster
+
+  # Worker nodes.
+  $ ssh mycluster-worker1
+  $ ssh mycluster-worker2
+
+The above are achieved by adding appropriate entries to ``~/.ssh/config``.
+
 Transfer files
 ===============
 
-After a task's execution,  use :code:`rsync` (or :code:`scp`) to download files (e.g., checkpoints):
+After a task's execution,  use :code:`rsync` or :code:`scp` to download files (e.g., checkpoints):
 
 .. code-block:: console
 

diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst
@@ -9,35 +9,40 @@ provisioning and distributed execution on many VMs.
 For example, here is a simple PyTorch Distributed training example:
 
 .. code-block:: yaml
+   :emphasize-lines: 6-6
 
-  name: resnet-distributed-app
+   name: resnet-distributed-app
 
-  resources:
-    accelerators: V100:4
+   resources:
+     accelerators: V100:4
 
-  num_nodes: 2
+   num_nodes: 2
 
-  setup: |
-    pip3 install --upgrade pip
-    git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
-    cd pytorch-distributed-resnet
-    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
-    pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
-    mkdir -p data  && mkdir -p saved_models && cd data && \
-      wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-    tar -xvzf cifar-10-python.tar.gz
+   setup: |
+     pip3 install --upgrade pip
+     git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
+     cd pytorch-distributed-resnet
+     # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+     pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
+     mkdir -p data  && mkdir -p saved_models && cd data && \
+       wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
+     tar -xvzf cifar-10-python.tar.gz
 
-  run: |
-    cd pytorch-distributed-resnet
+   run: |
+     cd pytorch-distributed-resnet
 
-    num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
-    master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
-    python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
-      --nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
-      --master_port=8008 resnet_ddp.py --num_epochs 20
+     num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
+     master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
+     python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+       --nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
+       --master_port=8008 resnet_ddp.py --num_epochs 20
 
 In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
-nodes. The :code:`setup` and :code:`run` commands are executed on both nodes.
+nodes, with each node having 4 V100s.
+
+
+Environment variables
+-----------------------------------------
 
 SkyPilot exposes these environment variables that can be accessed in a task's ``run`` commands:
 
@@ -54,3 +59,42 @@ SkyPilot exposes these environment variables that can be accessed in a task's ``
     :code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`.
 - :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
   task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction).
+
+
+Launching a multi-node task (new cluster)
+-----------------------------------------
+
+When using ``sky launch`` to launch a multi-node task on **a new cluster**, the following happens in sequence:
+
+1. Nodes are provisioned. (barrier)
+2. Workdir/file_mounts are synced to all nodes. (barrier)
+3. ``setup`` commands are executed on all nodes. (barrier)
+4. ``run`` commands are executed on all nodes.
+
+Launching a multi-node task (existing cluster)
+-----------------------------------------
+
+When using ``sky launch`` to launch a multi-node task on **an existing cluster**, the cluster may have more nodes than the current task's ``num_nodes`` requirement.
+
+The following happens in sequence:
+
+1. SkyPilot checks the runtime on all nodes are up-to-date. (barrier)
+2. Workdir/file_mounts are synced to all nodes. (barrier)
+3. ``setup`` commands are executed on **all nodes** of the cluster. (barrier)
+4. ``run`` commands are executed on **the subset of nodes** scheduled to execute the task, which may be fewer than the cluster size.
+
+Executing a task on the head node only
+-----------------------------------------
+To execute a task on the head node only (a common scenario for tools like
+``mpirun``), use the ``SKY_NODE_RANK`` environment variable as follows:
+
+.. code-block:: yaml
+
+   ...
+
+   num_nodes: <n>
+
+   run: |
+     if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
+         # Launch the head-only command here.
+     fi