Skip to content

Commit

Permalink
Merge branch 'master' into yaml-schema-update
Browse files Browse the repository at this point in the history
  • Loading branch information
iojw committed Jun 3, 2023
2 parents 4b2f61f + 430ba74 commit b020963
Show file tree
Hide file tree
Showing 106 changed files with 7,081 additions and 1,191 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ on:
- 'releases/**'
jobs:
format:
# Need to specify 20.04, because ubuntu-latest does not work with
# python 3.6: https://github.com/actions/setup-python/issues/355#issuecomment-1335042510
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8"]
Expand All @@ -36,9 +34,11 @@ jobs:
yapf --diff --recursive ./ --exclude 'sky/skylet/ray_patches/**' \
--exclude 'sky/skylet/providers/aws/**' \
--exclude 'sky/skylet/providers/gcp/**' \
--exclude 'sky/skylet/providers/azure/**'
--exclude 'sky/skylet/providers/azure/**' \
--exclude 'sky/skylet/providers/ibm/**'
- name: Running black
run: |
black --diff --check sky/skylet/providers/aws/ \
sky/skylet/providers/gcp/ \
sky/skylet/providers/azure/
sky/skylet/providers/azure/ \
sky/skylet/providers/ibm/
4 changes: 1 addition & 3 deletions .github/workflows/mypy-generic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@ on:
- 'releases/**'
jobs:
mypy:
# Need to specify 20.04, because ubuntu-latest does not work with
# python 3.6: https://github.com/actions/setup-python/issues/355#issuecomment-1335042510
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
steps:
- run: 'echo "No mypy to run"'
4 changes: 1 addition & 3 deletions .github/workflows/mypy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ on:
- 'releases/**'
jobs:
mypy:
# Need to specify 20.04, because ubuntu-latest does not work with
# python 3.6: https://github.com/actions/setup-python/issues/355#issuecomment-1335042510
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8"]
Expand Down
6 changes: 2 additions & 4 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,10 @@ on:

jobs:
pylint:
# Need to specify 20.04, because ubuntu-latest does not work with
# python 3.6: https://github.com/actions/setup-python/issues/355#issuecomment-1335042510
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.6"]
python-version: ["3.8"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
4 changes: 1 addition & 3 deletions .github/workflows/pytest-generic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ on:
- 'releases/**'
jobs:
python-test:
# Need to specify 20.04, because ubuntu-latest does not work with
# python 3.6: https://github.com/actions/setup-python/issues/355#issuecomment-1335042510
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
steps:
- run: 'echo "No tests to run"'
8 changes: 3 additions & 5 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
python-test:
strategy:
matrix:
python-version: [3.6]
python-version: [3.8]
test-path:
- tests/test_cli.py
- tests/test_config.py
Expand All @@ -27,9 +27,7 @@ jobs:
- tests/test_wheels.py
- tests/test_spot.py
- tests/test_yaml_parser.py
# Need to specify 20.04, because ubuntu-latest does not work with
# python 3.6: https://github.com/actions/setup-python/issues/355#issuecomment-1335042510
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2
Expand All @@ -55,4 +53,4 @@ jobs:
pip install pytest pytest-xdist pytest-env>=0.6
- name: Run tests with pytest
run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest ${{ matrix.test-path }}
run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 1 --dist no ${{ matrix.test-path }}
24 changes: 24 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: 'Close stale issues and PRs'
on:
schedule:
- cron: '30 1 * * *'

jobs:
stale:
runs-on: ubuntu-latest
steps:
- uses: actions/stale@v8
with:
stale-issue-message: 'This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
stale-pr-message: 'This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
close-issue-message: 'This issue was closed because it has been stalled for 10 days with no activity.'
close-pr-message: 'This PR was closed because it has been stalled for 10 days with no activity.'
days-before-issue-stale: 120
days-before-pr-stale: 120
days-before-issue-close: 10
days-before-pr-close: 10
exempt-issue-labels: 'P0,P1'
exempt-pr-labels: 'P0,P1'
ascending: true
operations-per-run: 100

2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ We use GitHub to track issues and features. For new contributors, we recommend l

### Installing SkyPilot for development
```bash
# SkyPilot requires python >= 3.6.
# SkyPilot requires python >= 3.7.
# You can just install the dependencies for
# certain clouds, e.g., ".[aws,azure,gcp,lambda]"
pip install -e ".[all]"
Expand Down
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@
----
:fire: :dromedary_camel: *News* :dromedary_camel: :fire:
- [April, 2023] **[**SkyPilot YAMLs released**](./llm/vicuna/) for finetuning & serving the Vicuna model with a single command**!
- [March, 2023] **[Vicuna LLM chatbot](https://vicuna.lmsys.org/) trained** [**using SkyPilot**](./llm/vicuna/) **for $300 on spot instances!**
- [March, 2023] **[Vicuna LLM chatbot](https://lmsys.org/blog/2023-03-30-vicuna/) trained** [**using SkyPilot**](./llm/vicuna/) **for $300 on spot instances!**
- [March, 2023] *Serve* your own LLaMA LLM chatbot (not finetuned) on any cloud: [**example**](./llm/llama-chatbots/), [**repo**](https://github.com/skypilot-org/sky-llama)
----

SkyPilot is a framework for easily and cost effectively running ML workloads[^1] on any cloud.

SkyPilot abstracts away the cloud infra burden:
- Launch jobs & clusters on any cloud (AWS, Azure, GCP, Lambda Cloud)
- Launch jobs & clusters on any cloud (AWS, Azure, GCP, Lambda Cloud, IBM, Samsung)
- Find scarce resources across zones/regions/clouds
- Queue jobs & use cloud object stores

Expand All @@ -47,9 +47,9 @@ SkyPilot cuts your cloud costs:

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Install with pip (choose your clouds) or [from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
Install with pip or [from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
```
pip install "skypilot[aws,gcp,azure,lambda]"
pip install "skypilot[aws,gcp,azure,lambda,ibm,scp]" # choose your clouds
```

## Getting Started
Expand Down Expand Up @@ -121,8 +121,9 @@ Refer to [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), and [many more](./examples).

More information:
- [Project blog](https://blog.skypilot.co/)
- [Introductory blog post](https://blog.skypilot.co/introducing-skypilot/)
- [SkyPilot Blog](https://blog.skypilot.co/)
- [Introductory blog post](https://blog.skypilot.co/introducing-skypilot/)
- [NSDI 2023 paper & talk](https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng)

## Issues, feature requests, and questions
We are excited to hear your feedback!
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples/spot-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ Cancel a spot job:
Real-world examples
-------------------------

* `Vicuna <https://vicuna.lmsys.org/>`_ LLM chatbot: `instructions <https://github.com/skypilot-org/skypilot/tree/master/examples/vicuna-llm>`_, `YAML <https://github.com/lm-sys/FastChat/blob/main/scripts/train-alpaca.yaml>`_
* `Vicuna <https://vicuna.lmsys.org/>`_ LLM chatbot: `instructions <https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna>`_, `YAML <https://github.com/skypilot-org/skypilot/blob/master/llm/vicuna/train.yaml>`_
* BERT (shown above): `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`_
* PyTorch DDP, ResNet: `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/resnet.yaml>`_
* PyTorch Lightning DDP, CIFAR-10: `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/lightning_cifar10.yaml>`_
Expand Down
48 changes: 44 additions & 4 deletions docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Install SkyPilot using pip:

.. code-block:: console
$ # SkyPilot requires python >= 3.6. For Apple Silicon, use >= 3.8.
$ # SkyPilot requires python >= 3.7. For Apple Silicon, use >= 3.8.
$ # Recommended: use a new conda env to avoid package conflicts.
$ conda create -y -n sky python=3.8
$ conda activate sky
Expand All @@ -17,9 +17,9 @@ Install SkyPilot using pip:
$ # pip install "skypilot[lambda]"
$ # pip install "skypilot[all]"
SkyPilot currently supports five cloud providers: AWS, GCP, Azure, Lambda Cloud and Cloudflare (R2).
SkyPilot currently supports seven cloud providers: AWS, GCP, Azure, Lambda Cloud, IBM, SCP, and Cloudflare (for R2 object store).
If you only have access to certain clouds, use any combination of
:code:`"[aws,azure,gcp,lambda,cloudflare]"` (e.g., :code:`"[aws,gcp]"`) to reduce the
:code:`"[aws,azure,gcp,lambda,cloudflare,scp]"` (e.g., :code:`"[aws,gcp]"`) to reduce the
dependencies installed.

You may also install SkyPilot from source.
Expand Down Expand Up @@ -107,6 +107,23 @@ Lambda Cloud
$ mkdir -p ~/.lambda_cloud
$ echo "api_key = <your_api_key_here>" > ~/.lambda_cloud/lambda_keys
IBM
~~~~~~~~~

To access IBM's services, store the following fields in ``~/.ibm/credentials.yaml``:

.. code-block:: text
iam_api_key: <user_personal_api_key>
resource_group_id: <resource_group_user_is_a_member_of>
- Create a new API key by following `this guide <https://www.ibm.com/docs/en/app-connect/container?topic=servers-creating-cloud-api-key>`_.
- Obtain a resource group's ID from the `web console <https://cloud.ibm.com/account/resource-groups>`_.

.. note::
Stock images aren't currently providing ML tools out of the box.
Create private images with the necessary tools (e.g. CUDA), by following the IBM segment in `this documentation <https://github.com/skypilot-org/skypilot/blob/master/docs/source/reference/yaml-spec.rst>`_.

Cloudflare R2
~~~~~~~~~~~~~~~~~~

Expand All @@ -118,7 +135,7 @@ SkyPilot can download/upload data to R2 buckets and mount them as local filesyst
$ # Install boto
$ pip install boto3
$ # Configure your R2 credentials
$ aws configure --profile r2
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
In the prompt, enter your R2 Access Key ID and Secret Access Key (see `instructions to generate R2 credentials <https://developers.cloudflare.com/r2/data-access/s3-api/tokens/>`_). Select :code:`auto` for the default region and :code:`json` for the default output format.

Expand All @@ -140,6 +157,28 @@ Next, get your `Account ID <https://developers.cloudflare.com/fundamentals/get-s

Support for R2 is in beta. Please report and issues on `Github <https://github.com/skypilot-org/skypilot/issues>`_ or reach out to us on `Slack <http://slack.skypilot.co/>`_.


SCP
~~~~~~~~~~~~~~~~~~

Samsung Cloud Platform(SCP) provides cloud services optimized for enterprise customers. You can learn more about SCP `here <https://cloud.samsungsds.com/>`__.

To configure SCP access, you need access keys and the ID of the project your tasks will run. Go to the `Access Key Management <https://cloud.samsungsds.com/console/#/common/access-key-manage/list?popup=true>`_ page on your SCP console to generate the access keys, and the Project Overview page for the project ID. Then, add them to :code:`~/.scp/scp_credential` by running:

.. code-block:: console
$ # Create directory if required
$ mkdir -p ~/.scp
$ # Add the lines for "access_key", "secret_key", and "project_id" to scp_credential file
$ echo "access_key = <your_access_key>" >> ~/.scp/scp_credential
$ echo "secret_key = <your_secret_key>" >> ~/.scp/scp_credential
$ echo "project_id = <your_project_id>" >> ~/.scp/scp_credential
.. note::

Multi-node clusters are currently not supported on SCP.


.. _verify-cloud-access:

Verifying cloud access
Expand All @@ -160,6 +199,7 @@ This will produce a summary like:
GCP: enabled
Azure: enabled
Lambda: enabled
SCP: enabled
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
Expand Down
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,9 @@ SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code change

**More information**

* `Project blog <https://blog.skypilot.co/>`_
* `SkyPilot blog <https://blog.skypilot.co/>`_
* `Introductory blog post <https://blog.skypilot.co/introducing-skypilot/>`_
* `NSDI 2023 paper & talk <https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng>`_
* `SkyPilot Tutorials <https://github.com/skypilot-org/skypilot-tutorial>`_
* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.

Expand Down
8 changes: 6 additions & 2 deletions docs/source/reference/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,8 @@ Storage CLI
:prog: sky storage delete
:nested: full

Utils: ``show-gpus``, ``check``
---------------------------------------
Utils: ``show-gpus``/``check``/``cost-report``
-------------------------------------------------


.. click:: sky.cli:show_gpus
Expand All @@ -108,3 +108,7 @@ Utils: ``show-gpus``, ``check``
.. click:: sky.cli:check
:prog: sky check
:nested: full

.. click:: sky.cli:cost_report
:prog: sky cost-report
:nested: full
10 changes: 5 additions & 5 deletions docs/source/reference/local/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,24 @@ Setting up Local Cluster

Prerequisites
-------------
To ensure sky nodes can communicate with each other, SkyPilot On-prem requires the system admin to open up all ports from :code:`10001` to :code:`19999`, inclusive, on all nodes. This is how SkyPilot differentiates input/output for multiple worker processes on a single node. In addition, SkyPilot requires port :code:`8265` for Ray Dashboard on all nodes.
To ensure sky nodes can communicate with each other, SkyPilot On-prem requires the system admin to open up all ports from :code:`10001` to :code:`19999`, inclusive, on all nodes. This is how SkyPilot differentiates input/output for multiple worker processes on a single node. In addition, SkyPilot requires port :code:`8266` for Ray Dashboard on all nodes.

For the head node, SkyPilot requires port :code:`6379` for the GCS server on Ray.
For the head node, SkyPilot requires port :code:`6380` for the GCS server on Ray.

For further reference, `here <https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations>`_ are the required ports directly from the Ray docs.

Installing SkyPilot dependencies
-----------------------------------

SkyPilot On-prem requires :code:`python3`, :code:`ray==2.0.1`, and :code:`sky` to be setup on all local nodes and globally available to all users.
SkyPilot On-prem requires :code:`python3`, :code:`ray==2.4.0`, and :code:`sky` to be setup on all local nodes and globally available to all users.

To install Ray and SkyPilot for all users, run the following commands on all local nodes:

.. code-block:: console
$ pip3 install ray[default]==2.0.1
$ pip3 install ray[default]==2.4.0
$ # SkyPilot requires python >= 3.6.
$ # SkyPilot requires python >= 3.7.
$ pip3 install skypilot
Expand Down
10 changes: 10 additions & 0 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,16 @@ Available fields:
# GCP
# To find GCP images: https://cloud.google.com/compute/docs/images
# image_id: projects/deeplearning-platform-release/global/images/family/tf2-ent-2-1-cpu-ubuntu-2004
#
# IBM
# Create a private VPC image and paste its ID in the following format:
# image_id: <unique_image_id>
# To create an image manually:
# https://cloud.ibm.com/docs/vpc?topic=vpc-creating-and-using-an-image-from-volume.
# To use an official VPC image creation tool:
# https://www.ibm.com/cloud/blog/use-ibm-packer-plugin-to-create-custom-images-on-ibm-cloud-vpc-infrastructure
# To use a more limited but easier to manage tool:
# https://github.com/IBM/vpc-img-inst
file_mounts:
# Uses rsync to sync local files/directories to all nodes of the cluster.
Expand Down
17 changes: 17 additions & 0 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,20 @@ To execute a task on the head node only (a common scenario for tools like
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
# Launch the head-only command here.
fi
SSH into worker nodes
---------------------
In addition to the head node, the SSH configurations for the worker nodes of a multi-node cluster are also added to ``~/.ssh/config`` as ``<cluster_name>-worker<n>``.
This allows you directly to SSH into the worker nodes, if required.

.. code-block:: console
# Assuming 3 nodes in a cluster named mycluster
# Head node.
$ ssh mycluster
# Worker nodes.
$ ssh mycluster-worker1
$ ssh mycluster-worker2
23 changes: 23 additions & 0 deletions examples/autogluon.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
resources:
cloud: gcp

setup: |
git clone https://github.com/autogluon/autogluon.git
conda activate autogluon
if [ $? -eq 0 ]; then
echo 'conda env exists'
else
conda create -n autogluon python=3.8 -y
conda activate autogluon
pip install torch==1.13.1+cpu torchvision==0.14.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
pip install autogluon
# Ray + Torch Dataloader failed with latest grpcio
# See: https://github.com/ray-project/ray/pull/33903
pip install grpcio==1.51.3
fi
run: |
conda activate autogluon
cd autogluon
python examples/automm/tabular_dl/example_tabular.py --mode single_hpo
Loading

0 comments on commit b020963

Please sign in to comment.