diff --git a/README.md b/README.md index 4e80073f55f..8053c60042b 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,7 @@ ---- :fire: *News* :fire: +- [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**single replica**](https://docs.mistral.ai/self-deployment/skypilot/); [**multiple replicas**](./llm/mixtral/). - [Nov, 2023] Example: Using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/) - [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot) - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) @@ -136,7 +137,8 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest Runnable examples: - LLMs on SkyPilot - - [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot) (from official Mistral team) + - [Mixtral 8x7B](./llm/mixtral/) + - [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot/) (from official Mistral team) - [vLLM: Serving LLM 24x Faster On the Cloud](./llm/vllm/) (from official vLLM team) - [Vicuna chatbots: Training & Serving](./llm/vicuna/) (from official Vicuna team) - [Train your own Vicuna on Llama-2](./llm/vicuna-llama-2/) diff --git a/llm/mixtral/README.md b/llm/mixtral/README.md new file mode 100644 index 00000000000..9faf923725a --- /dev/null +++ b/llm/mixtral/README.md @@ -0,0 +1,90 @@ +# Serving Mixtral from Mistral.ai + +Mistral AI released Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. Mistral.ai uses SkyPilot as [the default way](https://docs.mistral.ai/self-deployment/skypilot) to distribute their new model. This folder contains the code to serve Mixtral on any cloud with SkyPilot. + +There are three ways to serve the model: + +## 1. Serve with a single instance + +SkyPilot can help you serve Mixtral by automatically finding available resources on any cloud, provisioning the VM, opening the ports, and serving the model. To serve Mixtral with a single instance, run the following command: + +```bash +sky launch -c mixtral ./serve.yaml +``` + +Note that we specify the following resources, so that SkyPilot will automatically find any of the available GPUs specified by automatically [failover](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) through all the candidates (in the order of the prices): + +```yaml +resources: + accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} +``` + +### Accessing the model + +We can now access the model through the OpenAI API with the IP and port: + +```bash +IP=$(sky status --ip mixtral) + +curl -L http://$IP:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "mistralai/Mistral-7B-v0.1", + "prompt": "My favourite condiment is", + "max_tokens": 25 + }' +``` + +## 2. Serve with multiple instances + +When scaling up is required, [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) is the library built on top of SkyPilot, which can help you scale up the serving with multiple instances, while still providing a single endpoint. To serve Mixtral with multiple instances, run the following command: + +```bash +sky serve up -n mixtral ./serve.yaml +``` + +The additional arguments for serving specifies the way to check the healthiness of the service and manage the auto-restart of the service when unexpected failure happens: +```yaml +service: + readiness_probe: + path: /v1/chat/completions + post_data: + model: mistralai/Mixtral-8x7B-Instruct-v0.1 + messages: + - role: user + content: Hello! What is your name? + initial_delay_seconds: 1200 + replica_policy: + min_replicas: 1 + auto_restart: true +``` + +Optional: To further save the cost by 3-4x, we can use the spot instances as the replicas, and SkyServe will automatically manage the spot instances, monitor the prices and preemptions, and restart the replica when needed. +To do so, we can add `use_spot: true` to the `resources` field, i.e.: +```yaml +resources: + use_spot: true + accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} +``` + +### Accessing the model + +After the `sky serve up` command, there will be a single endpoint for the service. We can access the model through the OpenAI API with the IP and port: + +```bash +ENDPOINT=$(sky serve status --endpoint mixtral) + +curl -L http://$ENDPOINT/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", + "prompt": "My favourite condiment is", + "max_tokens": 25 + }' +``` + +## 3. Official guide from Mistral AI + +Mistral.ai also includes a guide for launching the Mixtral 8x7B model with SkyPilot in their official doc. Please refer to [this link](https://docs.mistral.ai/self-deployment/skypilot) for more details. + +> Note: the docker image of the official doc may not be updated yet, which can cause a failure where vLLM is complaining about the missing support for the model. Please feel free to create a new docker image with the setup commands in our [serve.yaml](./serve.yaml) file instead. diff --git a/llm/mixtral/serve.yaml b/llm/mixtral/serve.yaml new file mode 100644 index 00000000000..e3592e51d83 --- /dev/null +++ b/llm/mixtral/serve.yaml @@ -0,0 +1,44 @@ +# A example yaml for serving Mixtral model from Mistral.ai with an OpenAI API. +# Usage: +# 1. Launch on a single instance: `sky launch mixtral ./serve.yaml` +# 2. Scale up to multiple instances with a single endpoint: +# `sky serve up -n mixtral ./serve.yaml` +service: + readiness_probe: + path: /v1/chat/completions + post_data: + model: mistralai/Mixtral-8x7B-Instruct-v0.1 + messages: + - role: user + content: Hello! What is your name? + initial_delay_seconds: 1200 + replica_policy: + min_replicas: 2 + auto_restart: true + +resources: + accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} + ports: 8000 + disk_tier: high + +setup: | + conda activate mixtral + if [ $? -ne 0 ]; then + conda create -n mixtral -y python=3.10 + conda activate mixtral + fi + # We have to manually install Torch otherwise apex & xformers won't build + pip list | grep torch || pip install "torch>=2.0.0" + + pip list | grep vllm || pip install "git+https://github.com/vllm-project/vllm.git" + pip install git+https://github.com/huggingface/transformers + pip list | grep megablocks || pip install megablocks + +run: | + conda activate mixtral + export PATH=$PATH:/sbin + python -u -m vllm.entrypoints.openai.api_server \ + --host 0.0.0.0 \ + --model mistralai/Mixtral-8x7B-Instruct-v0.1 \ + --tensor-parallel-size 2 | tee ~/openai_api_server.log +