Skip to content

Commit

Permalink
DAG API Enhancements: Introducing Downstream Task Parsing and Explici…
Browse files Browse the repository at this point in the history
…t Flow Definition (#4067)

* provide an example, edited from pipeline.yml

* more focus on dependencies for user dag lib

* more powerful user interface

* load and dump new yaml format

* fix

* fix: reversed logic in add_edge

* [docs] Unroll k8s internal load balancer docs (#4083)

unroll load balancer docs

* rename

* refactor due to reviewer's comments

* generate task.name if not given

* [docs] `sky status --kubernetes` docs (#4064)

* observability docs

* comments

* [UX] Show log after failure and fix the color issue with narrow window (#4084)

* fix narrow window and show log path during exception

* format

* format

* [k8s] `sky status --k8s` refactor (#4079)

* refactor

* lint

* refactor, dataclass

* refactor, dataclass

* refactor

* lint

* add comments for add_edge

* add `print_exception_no_traceback` when raise

* make `Dag.tasks` a property

* print dependencies for `__repr__`

* move `get_unique_task_name` to common_utils

* [Performance] Use new GCP custom images (#4027)

* [Performance] Use new custom image to create GCP GPU VMs

* update image tags for both CPU and GPU

* always generate .sky/python_path

---------

Co-authored-by: Yika Luo <[email protected]>

* [GCP] Add H100 mega (#4099)

* Add H100 mega support on GCP

* fix for some other regions

* format

* fix resource type

* fix catalog fetching

* [GCP] Add gVNIC support (#4095)

* add gvnic support through config.yaml

* lint

* docs

* [Lambda] Lambda Cloud SkyPilot provisioner (#3865)

* feat: lambda cloud new provisioner

* feat: address cblmemo reviews and other reviews + make multi-node work again

* fix: quotes

* fix: address some reviews

* chore: rm unused option

* chore: update typedef

* feat: use lists directly

* fix: formatting

* chore: address reviews

* fix: formatting

* chore: rm query ports since default impl per review

* feat: add back query ports

* fix: formatting

* chore: add newline at eof

* feat: try removing query ports again

* [Docs] GKE Nvidia Driver installation instructions update (#4106)

* docs

* docs

* docs

* [Performance] Use new AWS custom images (#4091)

* rename methods to use downstream/edge terminology

* [Performance] Add Packer image generation scripts for GCP and AWS (#4068)

* [Performance] Add Packer image generation scripts for GCP and AWS

* Add docker install and tests

* solve nvidia container issue

* Install cuDNN

* [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073)

* [Performance] Add AWS script to copy images for all regions

* script to delete all AWS images across regions

* Add cloud dependencies to image

---------

Co-authored-by: Yika Luo <[email protected]>

* Disable AWS images.csv refreshing (#4116)

* [Docs] .skyignore doc (#4114)

* [Docs] .skyignore doc

* Correct typos

Co-authored-by: Zongheng Yang <[email protected]>

---------

Co-authored-by: Zongheng Yang <[email protected]>

* [Core] Raise error for none existing cluster when endpoint is called (#4117)

raise error for none existing cluster

* Refresh local aws images.csv when image not found (#4127)

Refresh local aws images.csv by pulling from github catalog when image tag not found

* [Docs] News revamps. (#4126)

* News revamps.

updates

updates

updates

updates

updates

updates

updates

updates

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* [Serve] Support manually terminating a replica and with purge option (#4032)

* define replica id param in cli

* create endpoint on controller

* call controller endpoint to scale down replica

* add classmethod decorator

* add handler methods for readability in cli

* update docstr and error msg, and inline in cli

* update log and return err msg

* add docstr, catch and reraise err, add stopped and nonexistent message

* inline constant to avoid circular import

* fix error statement and return encoded str

* add purge feature

* add purge replica usage in docstr

* use .get to handle unexpected packages

* fix: diff terminate replica when failed/purging or not

* fix: stay up to date for `is_controller_accessible`

* revert

* up to date with current APIs

* error handling

* when purged remove record in the main loop

* refactor due to reviewer's suggestions

* combine functions

* fix: terminate the healthy replica even with purge option

* remove abbr

* Update sky/serve/core.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/serve/core.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/serve/controller.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/serve/controller.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/cli.py

Co-authored-by: Tian Xia <[email protected]>

* got services hint

* check if not yes in the outside if branch

* fix some output messages

* Update sky/serve/core.py

Co-authored-by: Tian Xia <[email protected]>

* set conflict status code for already scheduled termination

* combine purge and normal terminating down branch together

* bump version

* global exception handler to render a json response with error messages

* fix: use responses.JSONResponse for dict serialize

* error messages for old controller

* fix: check version mismatch in generated code

* revert mistakenly change update_service

* refine already in terminating message

* fix: branch code workaround in cls.build

* wording

Co-authored-by: Tian Xia <[email protected]>

* refactor due to reviewer's comments

* fix use ux_utils

Co-authored-by: Tian Xia <[email protected]>

* add changelog as comments

* fix messages

* edit the message for mismatch error

Co-authored-by: Tian Xia <[email protected]>

* no traceback when raising in `terminate_replica`

* messages decode

* Apply suggestions from code review

Co-authored-by: Tian Xia <[email protected]>

* format

* forma

* Empty commit

---------

Co-authored-by: David Tran <[email protected]>
Co-authored-by: David Tran <[email protected]>
Co-authored-by: Tian Xia <[email protected]>

* [Provisioner] Support docker in Lambda Cloud and TPU (#4115)

* [Provisioner] Support docker in Lambda Cloud

* fix permission issue

* merge with check docker installed

* add tpu support & test

* patch lambda cloud

* add comment

* Apply suggestions from code review

Co-authored-by: Tian Xia <[email protected]>

* change wording all to up/downstream style

* Add unique suffix to task names, fallback to timestamp if unnamed

* Unify handling of single and multiple tasks without dependencies

* Refactor tasks initialization: use list comprehension and fail fast

* Fix remove task dependency description: upstream, not downstream

Co-authored-by: Tian Xia <[email protected]>

* Remove duplicated `self.edges`, use nx api instead

* [Serve] Add `ux_utils.print_exception_no_traceback()` for cleaner error output (#4111)

* add `ux_utils.print_exception_no_traceback()` for cleaner error output

* Empty commit

* remove unnecessary with block

* Partially revert: Remove unnecessary `ux_utils.print_exception_no_traceback()` wrappers (#4130)

fix unnecessary with block for returning

* Revert "Add unique suffix to task names, fallback to timestamp if unnamed"

Otherwise, users can not refer to the task by name in the DAG.

This reverts commit 8486352.

* comment the checking used as upstream logic

* [examples] Deepspeed fixes + k8s support (#4124)

deepspeed kubernetes fixes

* Empty commit

* [OCI] Support more OS types in addition to ubuntu (#4080)

* Bug fix for sky config file path resolution.

* format

* [OCI] Bug fix for image_id in Task YAML

* [OCI]: Support more OS types (esp. oraclelinux) in addition to ubuntu.

* format

* Disable system firewall

* Bug fix for validation of the Marketplace images

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <[email protected]>

* variable/function naming

* address review comments: not to change the service_catalog api. call oci_catalog directly for get os type for a image.

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/oci.py

Co-authored-by: Zhanghao Wu <[email protected]>

* address review comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* Apply suggestions from code review

Co-authored-by: Tian Xia <[email protected]>

* fix: typing.cast

* add TODOs for future function migration

* remove dependencies wording to reduce ambiguity

* temporarily add github actions

---------

Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: yika-luo <[email protected]>
Co-authored-by: Yika Luo <[email protected]>
Co-authored-by: Kote Mushegiani <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: David Tran <[email protected]>
Co-authored-by: David Tran <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Hysun He <[email protected]>
  • Loading branch information
11 people authored Oct 21, 2024
1 parent 340f384 commit 7d93b75
Show file tree
Hide file tree
Showing 83 changed files with 2,327 additions and 860 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ on:
branches:
- master
- 'releases/**'
- advanced-dag
pull_request:
branches:
- master
- 'releases/**'
- advanced-dag
merge_group:

jobs:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/mypy-generic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@ on:
branches:
- master
- 'releases/**'
- advanced-dag
pull_request:
branches:
- master
- 'releases/**'
- advanced-dag
merge_group:

jobs:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/mypy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ on:
branches:
- master
- 'releases/**'
- advanced-dag
pull_request:
branches:
- master
- 'releases/**'
- advanced-dag
jobs:
mypy:
runs-on: ubuntu-latest
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ on:
branches:
- master
- 'releases/**'
- advanced-dag
pull_request:
branches:
- master
- 'releases/**'
- advanced-dag
merge_group:

jobs:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/pytest-generic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ on:
branches:
- master
- 'releases/**'
- advanced-dag
pull_request:
branches:
- master
- 'releases/**'
- advanced-dag
merge_group:

jobs:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@ on:
branches:
- master
- 'releases/**'
- advanced-dag
pull_request:
branches:
- master
- 'releases/**'
- advanced-dag
merge_group:

jobs:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/test-doc-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ on:
branches:
- master
- 'releases/**'
- 'advanced-dag/**'
pull_request:
branches:
- master
- 'releases/**'
- 'advanced-dag/**'
merge_group:

jobs:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/test-poetry-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@ on:
branches:
- master
- 'releases/**'
- 'advanced-dag/**'
pull_request:
branches:
- master
- 'releases/**'
- 'advanced-dag/**'
merge_group:

jobs:
Expand Down
42 changes: 22 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,32 @@

----
:fire: *News* :fire:
- [Sep, 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/)
- [Sep, 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI.
- [Jul, 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
- [Apr, 2024] Serve **Qwen-110B** on your infra: [**example**](./llm/qwen/)
- [Apr, 2024] Using **Ollama** to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
- [Feb, 2024] Deploying and scaling **Gemma** with SkyServe: [**example**](./llm/gemma/)
- [Feb, 2024] Serving **Code Llama 70B** with vLLM and SkyServe: [**example**](./llm/codellama/)
- [Dec, 2023] **Mixtral 8x7B**, a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/)
- [Nov, 2023] Using **Axolotl** to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
- [Oct 2024] :tada: **SkyPilot crossed 1M+ downloads** :tada:: Thank you to our community! [**Twitter/X**](https://x.com/skypilot_org/status/1844770841718067638)
- [Sep 2024] Point, Launch and Serve **Llama 3.2** on Kubernetes or Any Cloud: [**example**](./llm/llama-3_2/)
- [Sep 2024] Run and deploy [**Pixtral**](./llm/pixtral), the first open-source multimodal model from Mistral AI.
- [Jun 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
- [Apr 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/)
- [Apr 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
- [Feb 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/)
- [Feb 2024] Serving [**Code Llama 70B**](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/)
- [Dec 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/)
- [Nov 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)

**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)

<details>
<summary>Archived</summary>

- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
- [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
- [Sep 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
- [Sep 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
- [Jul 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
- [Jun 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
- [Apr 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!

</details>

Expand Down
53 changes: 28 additions & 25 deletions docs/source/examples/syncing-code-artifacts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,31 +46,7 @@ VMs. The task is invoked under that working directory (so that it can call
scripts, access checkpoints, etc.).

.. note::

**Exclude files from syncing**

For large, multi-gigabyte workdirs, uploading may be slow because they
are synced to the remote VM(s). To exclude large files in
your workdir from being uploaded, add them to a :code:`.skyignore` file
under your workdir. :code:`.skyignore` follows RSYNC filter rules.

Example :code:`.skyignore` file:

.. code-block::
# Files that match pattern under ONLY CURRENT directory
/hello.py
/*.txt
/dir
# Files that match pattern under ALL directories
*.txt
hello.py
# Files that match pattern under a directory ./dir/
/dir/*.txt
Do NOT use ``.`` to indicate local directory (e.g. ``./hello.py``).
To exclude large files from being uploaded, see :ref:`exclude-uploading-files`.

.. note::

Expand Down Expand Up @@ -140,6 +116,33 @@ file_mount may be slow because they are processed by ``rsync``. Use
:ref:`SkyPilot bucket mounting <sky-storage>` to efficiently handle
large files.

.. _exclude-uploading-files:

Exclude uploading files
--------------------------------------
By default, SkyPilot uses your existing :code:`.gitignore` and :code:`.git/info/exclude` to exclude files from syncing.

Alternatively, you can use :code:`.skyignore` if you want to separate SkyPilot's syncing behavior from Git's.
If you use a :code:`.skyignore` file, SkyPilot will only exclude files based on that file without using the default Git files.

Any :code:`.skyignore` file under either your workdir or source paths of file_mounts is respected.

:code:`.skyignore` follows RSYNC filter rules, e.g.

.. code-block::
# Files that match pattern under CURRENT directory
/file.txt
/dir
/*.jar
/dir/*.jar
# Files that match pattern under ALL directories
*.jar
file.txt
Do _not_ use ``.`` to indicate local directory (e.g., instead of ``./file``, write ``/file``).

.. _downloading-files-and-artifacts:

Downloading files and artifacts
Expand Down
9 changes: 9 additions & 0 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,15 @@ Available fields and semantics:
# Default: 'LOCAL_CREDENTIALS'.
remote_identity: LOCAL_CREDENTIALS
# Enable gVNIC (optional).
#
# Set to true to use gVNIC on GCP instances. gVNIC offers higher performance
# for multi-node clusters, but costs more.
# Reference: https://cloud.google.com/compute/docs/networking/using-gvnic
#
# Default: false.
enable_gvnic: false
# Advanced Azure configurations (optional).
# Apply to all new instances but not existing ones.
azure:
Expand Down
9 changes: 5 additions & 4 deletions docs/source/reference/kubernetes/kubernetes-deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,9 @@ Deploying on Google Cloud GKE
# Example:
# gcloud container clusters get-credentials testcluster --region us-central1-c
3. [If using GPUs] If your GKE nodes have GPUs, you may need to to
`manually install <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_
nvidia drivers. You can do so by deploying the daemonset
3. [If using GPUs] For GKE versions newer than 1.30.1-gke.115600, NVIDIA drivers are pre-installed and no additional setup is required. If you are using an older GKE version, you may need to
`manually install <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`_
NVIDIA drivers for GPU support. You can do so by deploying the daemonset
depending on the GPU and OS on your nodes:

.. code-block:: console
Expand All @@ -133,7 +133,8 @@ Deploying on Google Cloud GKE
# For Ubuntu based nodes with L4 GPUs:
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` is listed under the ``Capacity`` section.
.. tip::
To verify if GPU drivers are set up, run ``kubectl describe nodes`` and verify that ``nvidia.com/gpu`` resource is listed under the ``Capacity`` section.

4. Verify your kubernetes cluster is correctly set up for SkyPilot by running :code:`sky check`:

Expand Down
51 changes: 51 additions & 0 deletions docs/source/reference/kubernetes/kubernetes-getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,57 @@ Once your cluster administrator has :ref:`setup a Kubernetes cluster <kubernetes
$ kubectl config set-context --current --namespace=mynamespace
Viewing cluster status
----------------------

To view the status of all SkyPilot resources in the Kubernetes cluster, run :code:`sky status --k8s`.

Unlike :code:`sky status` which lists only the SkyPilot resources launched by the current user,
:code:`sky status --k8s` lists all SkyPilot resources in the Kubernetes cluster across all users.

.. code-block:: console
$ sky status --k8s
Kubernetes cluster state (context: mycluster)
SkyPilot clusters
USER NAME LAUNCHED RESOURCES STATUS
alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP
alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP
alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP
bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP
bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP
bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP
Managed jobs
In progress tasks: 1 STARTING
USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED
bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED
bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING
bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED
bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED
You can also inspect the real-time GPU usage on the cluster with :code:`sky show-gpus --cloud kubernetes`.

.. code-block:: console
$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 L4 4 4
my-cluster-1 L4 4 4
my-cluster-2 L4 2 2
my-cluster-3 L4 2 2
my-cluster-4 H100 8 8
my-cluster-5 H100 8 8
.. _kubernetes-custom-images:

Using Custom Images
Expand Down
44 changes: 11 additions & 33 deletions docs/source/reference/kubernetes/kubernetes-ports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,40 +59,18 @@ To restrict your services to be accessible only within the cluster, you can set

Depending on your cloud, set the appropriate annotation in the SkyPilot config file (``~/.sky/config.yaml``):

.. tab-set::

.. tab-item:: GCP
:sync: internal-lb-gke

.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
custom_metadata:
annotations:
networking.gke.io/load-balancer-type: "Internal"
.. tab-item:: AWS
:sync: internal-lb-aws

.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
custom_metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
.. tab-item:: Azure
:sync: internal-lb-azure

.. code-block:: yaml
.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
custom_metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
# ~/.sky/config.yaml
kubernetes:
custom_metadata:
annotations:
# For GCP/GKE
networking.gke.io/load-balancer-type: "Internal"
# For AWS/EKS
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
# For Azure/AKS
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
.. _kubernetes-ingress:
Expand Down
Loading

0 comments on commit 7d93b75

Please sign in to comment.