Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add docs for Sky Serve #2794

Merged
merged 19 commits into from
Dec 11, 2023
Merged
4 changes: 3 additions & 1 deletion docs/source/examples/service-yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@ Available fields:
# and the default initial delay, you can use the following syntax:
readiness_probe: /v1/models

# Replica autoscaling policy (required). This describe how SkyServe autoscale
# One of the two following fields (replica_policy or replicas) is required.

# Replica autoscaling policy. This describe how SkyServe autoscale
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
# your service based on the QPS (queries per second) of your service.
replica_policy:
# Minimum number of replicas (required).
Expand Down
144 changes: 77 additions & 67 deletions docs/source/examples/sky-serve.rst
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,44 @@
Sky Serve
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
=========

GPU availability has become a critical bottleneck for many AI services. With Sky Serve, we offer a lightweight control plane that simplifies deployment across many cloud providers. By consolidating availability and pricing data across clouds, we ensure **timely execution at optimal costs**, addressing the complexities of managing resources in a multi-cloud environment.
Sky Serve is SkyPilot's serving library. Sky Serve takes an existing serving
framework and deploys it across one or more regions or clouds.
cblmemo marked this conversation as resolved.
Show resolved Hide resolved

.. * Serve on scarce resources (e.g., A100; spot) with **reduced costs and increased availability**

Why Sky Serve?

* Allocate scarce resources (e.g., A100) **across regions and clouds**
* Autoscale your endpoint deployment with load balancing
* Manage your multi-cloud resources with a single control plane
* **Bring any serving framework** (vLLM, TGI, FastAPI, ...) and scale it across regions/clouds
* **Reduce costs and increase availability** of service replicas by leveraging multiple/cheaper locations and hardware
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
* **Out-of-the-box load-balancing and autoscaling** of service replicas
* Manage multi-cloud, multi-region deployments with a single control plane
* Everything is launched inside your cloud accounts and VPCs

.. * Allocate scarce resources (e.g., A100) **across regions and clouds**
.. * Autoscale your endpoint deployment with load balancing
.. * Manage your multi-cloud resources with a single control plane

How it works

- Each service gets an endpoint that automatically redirects requests to its underlying replicas.
- The replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
- Sky Serve transparently handles the load balancing, failover, and autoscaling of the serving replicas.

Sky Serve provides a simple CLI interface to deploy and manage your services. It features a simple YAML spec to describe your service (We'll refer to such YAML as a **Service YAML** in the following), and a centralized controller to manage the deployment.
.. GPU availability has become a critical bottleneck for many AI services. With Sky
.. Serve, we offer a lightweight control plane that simplifies deployment across
.. many cloud providers. By consolidating availability and pricing data across
.. clouds, we ensure **timely execution at optimal costs**, addressing the
.. complexities of managing resources in a multi-cloud environment.

Service YAML
------------

To spin up a service, you can simply reuse your task YAML with the two following requirements:
Sky Serve provides a simple CLI interface to deploy and manage your services. It
features a simple YAML spec to describe your services (referred to as a *service
YAML* in the following) and a centralized controller to manage the deployments.

Hello, Sky Serve!
-----------------

Here we will go through an example to deploy a simple HTTP server with Sky Serve. To spin up a service, you can simply reuse your task YAML with the two following requirements:

#. An HTTP endpoint and the port on which it listens;
#. An extra :code:`service` section in your task YAML to describe the service configuration.
Expand Down Expand Up @@ -56,9 +80,9 @@ Notice that task YAML already have a running HTTP endpoint at 8080, and exposed

.. code-block:: yaml

# http-server.yaml
# hello-sky-serve.yaml
service:
readiness_probe: /health
readiness_probe: /
replicas: 2

resources:
Expand All @@ -69,41 +93,29 @@ Notice that task YAML already have a running HTTP endpoint at 8080, and exposed

run: python -m http.server 8080

You can found more configurations in :ref:`here <service-yaml-spec>`. This example will spin up two replicas of the service, each listening on port 8080. The service is considered ready when it responds to :code:`GET /health` with a 200 status code. You can customize the readiness probe by specifying a different path in the :code:`readiness_probe` field. By calling:
You can find more configurations in :ref:`here <service-yaml-spec>`. This example will spin up two replicas of the service, each listening on port 8080. The service is considered ready when it responds to :code:`GET /health` with a 200 status code. You can customize the readiness probe by specifying a different path in the :code:`readiness_probe` field. By calling:

.. code-block:: console

$ sky serve up http-server.yaml
$ sky serve up hello-sky-serve.yaml

Sky Serve will start a centralized controller/load balancer and deploy the service to the cloud with the best price/performance ratio. It will also monitor the service status and re-launch a new replica if one of them fails.

Under the hood, :code:`sky serve up`:

#. Launches a controller which handles autoscaling, monitoring and load balancing;
#. Returns an Service Endpoint which will be used to accept traffic;
#. Returns a Service Endpoint which will be used to accept traffic;
#. Meanwhile, the controller provisions replica VMs which later run the services;
#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.

After the controller is provisioned, you'll see:
After the controller is provisioned, you'll see the following in :code:`sky serve status` output:

.. code-block:: console
.. image:: ../images/sky-serve-status-output-provisioning.png
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
:width: 800
:align: center
:alt: sky-serve-status-output-provisioning

Service name: sky-service-e4fb
Endpoint URL: <endpoint-url>
To see detailed info: sky serve status sky-service-e4fb [--endpoint]
To teardown the service: sky serve down sky-service-e4fb

To see logs of a replica: sky serve logs sky-service-e4fb [REPLICA_ID]
To see logs of load balancer: sky serve logs --load-balancer sky-service-e4fb
To see logs of controller: sky serve logs --controller sky-service-e4fb

To monitor replica status: watch -n10 sky serve status sky-service-e4fb
To send a test request: curl -L <endpoint-url>

SkyServe is spinning up your service now.
The replicas should be ready within a short time.

Once any of the replicas becomes ready to serve, you can start sending requests to :code:`<endpoint-url>`. You can use :code:`watch -n10 sky serve status sky-service-e4fb` to monitor the latest status of the service. Once its status becomes :code:`READY`, you can start sending requests to :code:`<endpoint-url>`:
Once any of the replicas becomes ready to serve, you can start sending requests to :code:`<endpoint-url>`. You can use :code:`watch -n10 sky serve status sky-service-b0a0` to monitor the latest status of the service. Once its status becomes :code:`READY`, you can start sending requests to :code:`<endpoint-url>`:

.. code-block:: console

Expand All @@ -121,6 +133,23 @@ Once any of the replicas becomes ready to serve, you can start sending requests

The :code:`curl` command won't follow the redirect and print the content of the redirected page by default. Since we are using HTTP-redirect, you need to use :code:`curl -L <endpoint-url>`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a section here, e.g., "Example: Text Generation Inference (TGI)", which can consist of the two snippets in https://docs.google.com/document/d/1vVmzLF-EkG3Moj-q47DQBGvFipK4PNfkz0V6LyaPstE/edit#heading=h.gntyowdq9a18 or https://docs.google.com/document/d/1vVmzLF-EkG3Moj-q47DQBGvFipK4PNfkz0V6LyaPstE/edit#heading=h.gr15nxiws63p

The value is it's much shorter --> easier to adapt. It also quickly shows the idea of one endpoint being backed by multiple regions/clouds' replicas.

Can discuss whether to put it in the opening section of this page, like User Docs does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are adding this, do you think we still need the vicuna example? Not sure if it is a little bit redundant if we include TGI...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some redundancy is fine. The main motivation is to make the very first impression about real, useful AI serving. Currently it's HTTP server.

How about we add a "Quickstart: TGI service" section (or change it to vLLM/FastChat etc.), but keep the user doc's concise formatting -- 1 snippet showing YAML, 1 snippet showing service status, then add 1 snippet showing how to CURL it correctly. With minimal text throughout.

Sky Serve Architecture
----------------------

.. image:: ../images/sky-serve-architecture.png
:width: 800
:align: center
:alt: Sky Serve Architecture

Sky Serve has a centralized controller VM that manages the deployment of your service. Each service will have a process group to manage its replicas and route traffic to them.

It is composed of the following components:

#. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.

All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.

An end-to-end LLM example
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
-------------------------

Expand All @@ -147,7 +176,7 @@ Below we show an end-to-end example of deploying a LLM model with Sky Serve. We'

run: |
conda activate chatbot

echo 'Starting controller...'
python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
sleep 10
Expand All @@ -163,47 +192,38 @@ Below we show an end-to-end example of deploying a LLM model with Sky Serve. We'
python -u -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8080 | tee ~/openai_api_server.log

envs:
MODEL_SIZE: 13
MODEL_SIZE: 7

cblmemo marked this conversation as resolved.
Show resolved Hide resolved
By adding a :code:`service` section to the YAML:
The above SkyPilot Task YAML will launch an OpenAI API endpoint with a 7B Vicuna model. This YAML can be used with :code:`sky launch` to launch a single replica of the service. By adding a :code:`service` section to the YAML, we can scale it into multiple replicas across multiple regions/clouds:

.. code-block:: yaml

# vicuna.yaml
service:
readiness_probe: /v1/models
replicas: 2
replicas: 3

resources:
ports: 8080
# Here goes other resources config

# Here goes other task config

Now you have an Service YAML that can be used with Sky Serve! Simply run :code:`sky serve up vicuna.yaml -n vicuna` to deploy the service (use :code:`-n` to give your service a name!). After a while, you'll see:
Now you have a Service YAML that can be used with Sky Serve! Simply run :code:`sky serve up vicuna.yaml -n vicuna` to deploy the service (use :code:`-n` to give your service a name!). After a while, there will be an OpenAI Compatible API endpoint ready to accept traffic (:code:`44.201.113.28:30001` in the following example):

.. code-block:: console
.. image:: ../images/sky-serve-status-vicuna-ready.png
:width: 800
:align: center
:alt: sky-serve-status-vicuna-ready

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a CURL based command like the user doc. Does something like this work

Then it’s ready to accept traffic!

$ curl -L Y.Y.Y.Y:8082/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
{"generated_text":"\n nobody knows"}

Service name: vicuna
Endpoint URL: <vicuna-url>
To see detailed info: sky serve status vicuna [--endpoint]
To teardown the service: sky serve down vicuna

To see logs of a replica: sky serve logs vicuna [REPLICA_ID]
To see logs of load balancer: sky serve logs --load-balancer vicuna
To see logs of controller: sky serve logs --controller vicuna

To monitor replica status: watch -n10 sky serve status vicuna
To send a test request: curl -L <vicuna-url>

After a while, there will be an OpenAI Compatible API endpoint ready to serve at :code:`<vicuna-url>`. Try out by the following simple chatbot Python script:
Try out by the following simple chatbot Python script:

.. code-block:: python

import openai

stream = True
model = 'vicuna-13b-v1.3' # This is aligned with the MODEL_SIZE env in the YAML
model = 'vicuna-7b-v1.3' # This is aligned with the MODEL_SIZE env in the YAML
init_prompt = 'You are a helpful assistant.'
history = [{'role': 'system', 'content': init_prompt}]
endpoint = input('Endpoint: ')
Expand Down Expand Up @@ -241,22 +261,10 @@ See all running services:

$ sky serve status

.. code-block:: console

Services
NAME UPTIME STATUS REPLICAS ENDPOINT
llama2-spot 2h 29m 36s READY 1/2 34.238.42.4:30001
vicuna 3h 5m 56s READY 2/2 34.238.42.4:30003
http-server 3h 20m 50s READY 2/2 34.238.42.4:30002

Service Replicas
SERVICE_NAME ID IP LAUNCHED RESOURCES STATUS REGION
llama2-spot 1 34.90.186.40 2 hrs ago 1x GCP([Spot]{'A100': 1})) READY europe-west4
llama2-spot 2 34.147.124.113 2 hrs ago 1x GCP([Spot]{'A100': 1})) READY europe-west4
vicuna 1 35.247.122.252 3 hrs ago 1x GCP({'A100': 1})) READY us-west1
vicuna 2 34.141.221.32 3 hrs ago 1x GCP({'A100': 1})) READY europe-west4
http-server 1 3.95.5.141 3 hrs ago 1x AWS(vCPU=2) READY us-east-1
http-server 2 54.175.170.174 3 hrs ago 1x AWS(vCPU=2) READY us-east-1
.. image:: ../images/sky-serve-status-full.png
:width: 800
:align: center
:alt: sky-serve-status-full

Stream the logs of a service:

Expand Down Expand Up @@ -287,6 +295,8 @@ Thus, **no user action is needed** to manage its lifecycle.

You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag.

.. _customizing-sky-serve-controller-resources:

Customizing sky serve controller resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
Binary file modified docs/source/images/sky-serve-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/images/sky-serve-status-full.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Documentation

.. toctree::
:maxdepth: 1
:caption: Spin up Services
:caption: Multi-Cloud Serving
cblmemo marked this conversation as resolved.
Show resolved Hide resolved

examples/sky-serve
examples/service-yaml-spec
Expand Down