From 2e3bb48fba2b562049c1df5cfd0f4047ee45addd Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 31 Mar 2023 21:25:45 -0700 Subject: [PATCH] Update Managed Spot docs. (#1830) --- docs/source/examples/spot-jobs.rst | 40 ++++++++++++++++++++++++------ 1 file changed, 32 insertions(+), 8 deletions(-) diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst index 9bad013ba27..2ddd47a1fad 100644 --- a/docs/source/examples/spot-jobs.rst +++ b/docs/source/examples/spot-jobs.rst @@ -9,14 +9,18 @@ This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making SkyPilot automatically finds available spot resources across regions and clouds to maximize availability. Here is an example of a BERT training job failing over different regions across AWS and GCP. +.. image:: https://i.imgur.com/Vteg3fK.gif + :width: 600 + :alt: GIF for BERT training on Spot V100 + .. image:: ../images/spot-training.png :width: 600 - :alt: BERT training on Spot V100 + :alt: Static plot, BERT training on Spot V100 -Below are the requirements for using managed spot jobs: +To use managed spot jobs, there are two requirements: #. **Task YAML**: Managed Spot requires a YAML to describe the job, tested with :code:`sky launch`. -#. **Checkpointing and recovery** (optional): For job recovery with less progress resuming, application code can checkpoint periodically to a :ref:`SkyPilot Storage `-mounted cloud bucket. The program can reload the latest checkpoint when restarted. +#. **Checkpointing** (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`SkyPilot Storage `-mounted cloud bucket. The program can reload the latest checkpoint when restarted. Task YAML @@ -183,27 +187,47 @@ Useful CLIs Here are some commands for managed spot jobs. Check :code:`sky spot --help` for more details. +See all spot jobs: + .. code-block:: console - # Check the status of the spot jobs $ sky spot queue + +.. code-block:: console + Fetching managed spot job statuses... Managed spot jobs: ID NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS 2 roberta 1x [A100:8] 2 hrs ago 2h 47m 18s 2h 36m 18s 0 RUNNING 1 bert-qa 1x [V100:1] 4 hrs ago 4h 24m 26s 4h 17m 54s 0 RUNNING - # Stream the logs of a running spot job - $ sky spot logs -n bert-qa +Stream the logs of a running spot job: + +.. code-block:: console + + $ sky spot logs -n bert-qa # by name + $ sky spot logs 2 # by job ID - # Cancel a spot job by name - $ sky spot cancel -n bert-qa +Cancel a spot job: + +.. code-block:: console + + $ sky spot cancel -n bert-qa # by name + $ sky spot cancel 2 # by job ID .. note:: If any failure happens for a spot job, you can check :code:`sky spot queue -a` for the brief reason of the failure. For more details, it would be helpful to check :code:`sky spot logs --controller `. +Real-world examples +------------------------- + +* `Vicuna `_ LLM chatbot: `instructions `_, `YAML `_ +* BERT (shown above): `YAML `_ +* PyTorch DDP, ResNet: `YAML `_ +* PyTorch Lightning DDP, CIFAR-10: `YAML `_ + Spot controller (Advanced) -------------------------------