skypilot-org · concretevitamin · Dec 11, 2023 · Nov 16, 2023 · Nov 16, 2023 · Nov 18, 2023
diff --git a/docs/source/examples/service-yaml-spec.rst b/docs/source/examples/service-yaml-spec.rst
@@ -33,7 +33,9 @@ Available fields:
       # and the default initial delay, you can use the following syntax:
       readiness_probe: /v1/models
 
-      # Replica autoscaling policy (required). This describe how SkyServe autoscale
+      # One of the two following fields (replica_policy or replicas) is required.
+
+      # Replica autoscaling policy. This describe how SkyServe autoscale
       # your service based on the QPS (queries per second) of your service.
       replica_policy:
         # Minimum number of replicas (required).

diff --git a/docs/source/examples/sky-serve.rst b/docs/source/examples/sky-serve.rst
@@ -3,20 +3,44 @@
 Sky Serve
 =========
 
-GPU availability has become a critical bottleneck for many AI services. With Sky Serve, we offer a lightweight control plane that simplifies deployment across many cloud providers. By consolidating availability and pricing data across clouds, we ensure **timely execution at optimal costs**, addressing the complexities of managing resources in a multi-cloud environment.
+Sky Serve is SkyPilot's serving library. Sky Serve takes an existing serving
+framework and deploys it across one or more regions or clouds.
+
+.. * Serve on scarce resources (e.g., A100; spot) with **reduced costs and increased availability**
 
 Why Sky Serve?
 
-* Allocate scarce resources (e.g., A100) **across regions and clouds**
-* Autoscale your endpoint deployment with load balancing
-* Manage your multi-cloud resources with a single control plane
+* **Bring any serving framework** (vLLM, TGI, FastAPI, ...) and scale it across regions/clouds
+* **Reduce costs and increase availability** of service replicas by leveraging multiple/cheaper locations and hardware
+* **Out-of-the-box load-balancing and autoscaling** of service replicas
+* Manage multi-cloud, multi-region deployments with a single control plane
+* Everything is launched inside your cloud accounts and VPCs
+
+.. * Allocate scarce resources (e.g., A100) **across regions and clouds**
+.. * Autoscale your endpoint deployment with load balancing
+.. * Manage your multi-cloud resources with a single control plane
+
+How it works
+
+- Each service gets an endpoint that automatically redirects requests to its underlying replicas.
+- The replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
+- Sky Serve transparently handles the load balancing, failover, and autoscaling of the serving replicas.
 
-Sky Serve provides a simple CLI interface to deploy and manage your services. It features a simple YAML spec to describe your service (We'll refer to such YAML as a **Service YAML** in the following), and a centralized controller to manage the deployment.
+.. GPU availability has become a critical bottleneck for many AI services. With Sky
+.. Serve, we offer a lightweight control plane that simplifies deployment across
+.. many cloud providers. By consolidating availability and pricing data across
+.. clouds, we ensure **timely execution at optimal costs**, addressing the
+.. complexities of managing resources in a multi-cloud environment.
 
-Service YAML
-------------
 
-To spin up a service, you can simply reuse your task YAML with the two following requirements:
+Sky Serve provides a simple CLI interface to deploy and manage your services. It
+features a simple YAML spec to describe your services (referred to as a *service
+YAML* in the following) and a centralized controller to manage the deployments.
+
+Hello, Sky Serve!
+-----------------
+
+Here we will go through an example to deploy a simple HTTP server with Sky Serve. To spin up a service, you can simply reuse your task YAML with the two following requirements:
 
 #. An HTTP endpoint and the port on which it listens;
 #. An extra :code:`service` section in your task YAML to describe the service configuration.
@@ -56,9 +80,9 @@ Notice that task YAML already have a running HTTP endpoint at 8080, and exposed
 
 .. code-block:: yaml
 
-    # http-server.yaml
+    # hello-sky-serve.yaml
     service:
-      readiness_probe: /health
+      readiness_probe: /
       replicas: 2
 
     resources:
@@ -69,41 +93,29 @@ Notice that task YAML already have a running HTTP endpoint at 8080, and exposed
 
     run: python -m http.server 8080
 
-You can found more configurations in :ref:`here <service-yaml-spec>`. This example will spin up two replicas of the service, each listening on port 8080. The service is considered ready when it responds to :code:`GET /health` with a 200 status code. You can customize the readiness probe by specifying a different path in the :code:`readiness_probe` field. By calling:
+You can find more configurations in :ref:`here <service-yaml-spec>`. This example will spin up two replicas of the service, each listening on port 8080. The service is considered ready when it responds to :code:`GET /health` with a 200 status code. You can customize the readiness probe by specifying a different path in the :code:`readiness_probe` field. By calling:
 
 .. code-block:: console
 
-    $ sky serve up http-server.yaml
+    $ sky serve up hello-sky-serve.yaml
 
 Sky Serve will start a centralized controller/load balancer and deploy the service to the cloud with the best price/performance ratio. It will also monitor the service status and re-launch a new replica if one of them fails.
 
 Under the hood, :code:`sky serve up`:
 
 #. Launches a controller which handles autoscaling, monitoring and load balancing;
-#. Returns an Service Endpoint which will be used to accept traffic;
+#. Returns a Service Endpoint which will be used to accept traffic;
 #. Meanwhile, the controller provisions replica VMs which later run the services;
 #. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.
 
-After the controller is provisioned, you'll see:
+After the controller is provisioned, you'll see the following in :code:`sky serve status` output:
 
-.. code-block:: console
+.. image:: ../images/sky-serve-status-output-provisioning.png
+    :width: 800
+    :align: center
+    :alt: sky-serve-status-output-provisioning
 
-    Service name: sky-service-e4fb
-    Endpoint URL: <endpoint-url>
-    To see detailed info:           sky serve status sky-service-e4fb [--endpoint]
-    To teardown the service:        sky serve down sky-service-e4fb
-
-    To see logs of a replica:       sky serve logs sky-service-e4fb [REPLICA_ID]
-    To see logs of load balancer:   sky serve logs --load-balancer sky-service-e4fb
-    To see logs of controller:      sky serve logs --controller sky-service-e4fb
-
-    To monitor replica status:      watch -n10 sky serve status sky-service-e4fb
-    To send a test request:         curl -L <endpoint-url>
-
-    SkyServe is spinning up your service now.
-    The replicas should be ready within a short time.
-
-Once any of the replicas becomes ready to serve, you can start sending requests to :code:`<endpoint-url>`. You can use :code:`watch -n10 sky serve status sky-service-e4fb` to monitor the latest status of the service. Once its status becomes :code:`READY`, you can start sending requests to :code:`<endpoint-url>`:
+Once any of the replicas becomes ready to serve, you can start sending requests to :code:`<endpoint-url>`. You can use :code:`watch -n10 sky serve status sky-service-b0a0` to monitor the latest status of the service. Once its status becomes :code:`READY`, you can start sending requests to :code:`<endpoint-url>`:
 
 .. code-block:: console
 
@@ -121,6 +133,23 @@ Once any of the replicas becomes ready to serve, you can start sending requests
 
   The :code:`curl` command won't follow the redirect and print the content of the redirected page by default. Since we are using HTTP-redirect, you need to use :code:`curl -L <endpoint-url>`.
 
+Sky Serve Architecture
+----------------------
+
+.. image:: ../images/sky-serve-architecture.png
+    :width: 800
+    :align: center
+    :alt: Sky Serve Architecture
+
+Sky Serve has a centralized controller VM that manages the deployment of your service. Each service will have a process group to manage its replicas and route traffic to them.
+
+It is composed of the following components:
+
+#. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
+#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.
+
+All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.
+
 An end-to-end LLM example
 -------------------------
 
@@ -147,7 +176,7 @@ Below we show an end-to-end example of deploying a LLM model with Sky Serve. We'
 
     run: |
       conda activate chatbot
-  
+
       echo 'Starting controller...'
       python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
       sleep 10
@@ -163,47 +192,38 @@ Below we show an end-to-end example of deploying a LLM model with Sky Serve. We'
       python -u -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8080 | tee ~/openai_api_server.log
 
     envs:
-      MODEL_SIZE: 13
+      MODEL_SIZE: 7
 
-By adding a :code:`service` section to the YAML:
+The above SkyPilot Task YAML will launch an OpenAI API endpoint with a 7B Vicuna model. This YAML can be used with :code:`sky launch` to launch a single replica of the service. By adding a :code:`service` section to the YAML, we can scale it into multiple replicas across multiple regions/clouds:
 
 .. code-block:: yaml
 
     # vicuna.yaml
     service:
       readiness_probe: /v1/models
-      replicas: 2
+      replicas: 3
 
     resources:
       ports: 8080
       # Here goes other resources config
 
     # Here goes other task config
 
-Now you have an Service YAML that can be used with Sky Serve! Simply run :code:`sky serve up vicuna.yaml -n vicuna` to deploy the service (use :code:`-n` to give your service a name!). After a while, you'll see:
+Now you have a Service YAML that can be used with Sky Serve! Simply run :code:`sky serve up vicuna.yaml -n vicuna` to deploy the service (use :code:`-n` to give your service a name!). After a while, there will be an OpenAI Compatible API endpoint ready to accept traffic (:code:`44.201.113.28:30001` in the following example):
 
-.. code-block:: console
+.. image:: ../images/sky-serve-status-vicuna-ready.png
+    :width: 800
+    :align: center
+    :alt: sky-serve-status-vicuna-ready
 
-    Service name: vicuna
-    Endpoint URL: <vicuna-url>
-    To see detailed info:           sky serve status vicuna [--endpoint]
-    To teardown the service:        sky serve down vicuna
-
-    To see logs of a replica:       sky serve logs vicuna [REPLICA_ID]
-    To see logs of load balancer:   sky serve logs --load-balancer vicuna
-    To see logs of controller:      sky serve logs --controller vicuna
-
-    To monitor replica status:      watch -n10 sky serve status vicuna
-    To send a test request:         curl -L <vicuna-url>
-
-After a while, there will be an OpenAI Compatible API endpoint ready to serve at :code:`<vicuna-url>`. Try out by the following simple chatbot Python script:
+Try out by the following simple chatbot Python script:
 
 .. code-block:: python
 
     import openai
 
     stream = True
-    model = 'vicuna-13b-v1.3' # This is aligned with the MODEL_SIZE env in the YAML
+    model = 'vicuna-7b-v1.3' # This is aligned with the MODEL_SIZE env in the YAML
     init_prompt = 'You are a helpful assistant.'
     history = [{'role': 'system', 'content': init_prompt}]
     endpoint = input('Endpoint: ')
@@ -241,22 +261,10 @@ See all running services:
 
     $ sky serve status
 
-.. code-block:: console
-
-    Services
-    NAME         UPTIME      STATUS  REPLICAS  ENDPOINT           
-    llama2-spot  2h 29m 36s  READY   1/2       34.238.42.4:30001
-    vicuna       3h 5m 56s   READY   2/2       34.238.42.4:30003
-    http-server  3h 20m 50s  READY   2/2       34.238.42.4:30002
-
-    Service Replicas
-    SERVICE_NAME  ID  IP              LAUNCHED   RESOURCES                   STATUS  REGION
-    llama2-spot   1   34.90.186.40    2 hrs ago  1x GCP([Spot]{'A100': 1}))  READY   europe-west4
-    llama2-spot   2   34.147.124.113  2 hrs ago  1x GCP([Spot]{'A100': 1}))  READY   europe-west4
-    vicuna        1   35.247.122.252  3 hrs ago  1x GCP({'A100': 1}))        READY   us-west1
-    vicuna        2   34.141.221.32   3 hrs ago  1x GCP({'A100': 1}))        READY   europe-west4
-    http-server   1   3.95.5.141      3 hrs ago  1x AWS(vCPU=2)              READY   us-east-1
-    http-server   2   54.175.170.174  3 hrs ago  1x AWS(vCPU=2)              READY   us-east-1
+.. image:: ../images/sky-serve-status-full.png
+    :width: 800
+    :align: center
+    :alt: sky-serve-status-full
 
 Stream the logs of a service:
 
@@ -287,6 +295,8 @@ Thus, **no user action is needed** to manage its lifecycle.
 
 You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag.
 
+.. _customizing-sky-serve-controller-resources:
+
 Customizing sky serve controller resources
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/docs/source/images/sky-serve-architecture.png b/docs/source/images/sky-serve-architecture.png
diff --git a/docs/source/images/sky-serve-status-full.png b/docs/source/images/sky-serve-status-full.png
diff --git a/docs/source/images/sky-serve-status-output-provisioning.png b/docs/source/images/sky-serve-status-output-provisioning.png
diff --git a/docs/source/images/sky-serve-status-vicuna-ready.png b/docs/source/images/sky-serve-status-vicuna-ready.png
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -112,7 +112,7 @@ Documentation
 
 .. toctree::
    :maxdepth: 1
-   :caption: Spin up Services
+   :caption: Multi-Cloud Serving
 
    examples/sky-serve
    examples/service-yaml-spec