diff --git a/examples/notebooks/beam-ml/run_inference_vllm.ipynb b/examples/notebooks/beam-ml/run_inference_vllm.ipynb new file mode 100644 index 000000000000..008c4262d5ce --- /dev/null +++ b/examples/notebooks/beam-ml/run_inference_vllm.ipynb @@ -0,0 +1,614 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "gpuType": "T4", + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "code", + "source": [ + "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n", + "\n", + "# Licensed to the Apache Software Foundation (ASF) under one\n", + "# or more contributor license agreements. See the NOTICE file\n", + "# distributed with this work for additional information\n", + "# regarding copyright ownership. The ASF licenses this file\n", + "# to you under the Apache License, Version 2.0 (the\n", + "# \"License\"); you may not use this file except in compliance\n", + "# with the License. You may obtain a copy of the License at\n", + "#\n", + "# http://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing,\n", + "# software distributed under the License is distributed on an\n", + "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", + "# KIND, either express or implied. See the License for the\n", + "# specific language governing permissions and limitations\n", + "# under the License" + ], + "metadata": { + "id": "OsFaZscKSPvo" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Run ML inference by using vLLM on GPUs\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " Run in Google Colab\n", + " \n", + " View source on GitHub\n", + "
" + ], + "metadata": { + "id": "NrHRIznKp3nS" + } + }, + { + "cell_type": "markdown", + "source": [ + "[vLLM](https://github.com/vllm-project/vllm) is a fast and user-frienly library for LLM inference and serving. vLLM optimizes LLM inference with mechanisms like PagedAttention for memory management and continuous batching for increasing throughput. For popular models, vLLM has been shown to increase throughput by a multiple of 2 to 4. With Apache Beam, you can serve models with vLLM and scale that serving with just a few lines of code.\n", + "\n", + "This notebook demonstrates how to run machine learning inference by using vLLM and GPUs in three ways:\n", + "\n", + "* locally without Apache Beam\n", + "* locally with the Apache Beam local runner\n", + "* remotely with the Dataflow runner\n", + "\n", + "It also shows how to swap in a different model without modifying your pipeline structure by changing the configuration." + ], + "metadata": { + "id": "H0ZFs9rDvtJm" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Requirements\n", + "\n", + "This notebook assumes that a GPU is enabled in Colab. If this setting isn't enabled, the locally executed sections of this notebook might not work. To enable a GPU, in the Colab menu, click **Runtime** > **Change runtime type**. For **Hardware accelerator**, choose a GPU accelerator. If you can't access a GPU in Colab, you can run the Dataflow section of this notebook.\n", + "\n", + "To run the Dataflow section, you need access to the following resources:\n", + "\n", + "- a computer with Docker installed\n", + "- a [Google Cloud](https://cloud.google.com/) account" + ], + "metadata": { + "id": "6x41tnbTvQM1" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Install dependencies\n", + "\n", + "Before creating your pipeline, download and install the dependencies required to develop with Apache Beam and vLLM. vLLM is supported in Apache Beam versions 2.60.0 and later." + ], + "metadata": { + "id": "8PSjyDIavRcn" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "irCKNe42p22r" + }, + "outputs": [], + "source": [ + "!pip install openai>=1.52.2\n", + "!pip install vllm>=0.6.3\n", + "!pip install apache-beam[gcp]==2.60.0\n", + "!pip check" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Run locally without Apache Beam\n", + "\n", + "In this section, you run a vLLM server without using Apache Beam. Use the `facebook/opt-125m` model. This model is small enough to fit in Colab memory and doesn't require any extra authentication.\n", + "\n", + "First, start the vLLM server. This step might take a minute or two, because the model needs to download before vLLM starts running inference." + ], + "metadata": { + "id": "3xz8zuA7vcS4" + } + }, + { + "cell_type": "code", + "source": [ + "! python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m" + ], + "metadata": { + "id": "GbJGzINNt5sG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, while the vLLM server is running, open a separate terminal to communicate with the vLLM serving process. To open a terminal in Colab, in the sidebar, click **Terminal**. In the terminal, run the following commands.\n", + "\n", + "```\n", + "pip install openai\n", + "python\n", + "\n", + "from openai import OpenAI\n", + "\n", + "# Modify OpenAI's API key and API base to use vLLM's API server.\n", + "openai_api_key = \"EMPTY\"\n", + "openai_api_base = \"http://localhost:8000/v1\"\n", + "client = OpenAI(\n", + " api_key=openai_api_key,\n", + " base_url=openai_api_base,\n", + ")\n", + "completion = client.completions.create(model=\"facebook/opt-125m\",\n", + " prompt=\"San Francisco is a\")\n", + "print(\"Completion result:\", completion)\n", + "```\n", + "\n", + "This code runs against the server running in the cell. You can experiment with different prompts." + ], + "metadata": { + "id": "n35LXTS3uzIC" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Run locally with Apache Beam\n", + "\n", + "In this section, you set up an Apache Beam pipeline to run a job with an embedded vLLM instance.\n", + "\n", + "First, define the `VllmCompletionsModelHandler` object. This configuration object gives Apache Beam the information that it needs to create a dedicated vLLM process in the middle of the pipeline. Apache Beam then provides examples to the pipeline. No additional code is needed." + ], + "metadata": { + "id": "Hbxi83BfwbBa" + } + }, + { + "cell_type": "code", + "source": [ + "from apache_beam.ml.inference.base import RunInference\n", + "from apache_beam.ml.inference.vllm_inference import VLLMCompletionsModelHandler\n", + "from apache_beam.ml.inference.base import PredictionResult\n", + "import apache_beam as beam\n", + "\n", + "model_handler = VLLMCompletionsModelHandler('facebook/opt-125m')" + ], + "metadata": { + "id": "sUqjOzw3wpI4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, define examples to run inference against, and define a helper function to print out the inference results." + ], + "metadata": { + "id": "N06lXRKRxCz5" + } + }, + { + "cell_type": "code", + "source": [ + "class FormatOutput(beam.DoFn):\n", + " def process(self, element, *args, **kwargs):\n", + " yield \"Input: {input}, Output: {output}\".format(input=element.example, output=element.inference)\n", + "\n", + "prompts = [\n", + " \"Hello, my name is\",\n", + " \"The president of the United States is\",\n", + " \"The capital of France is\",\n", + " \"The future of AI is\",\n", + " \"Emperor penguins are\",\n", + "]" + ], + "metadata": { + "id": "3a1PznmtxNR_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, run the pipeline.\n", + "\n", + "This step might take a minute or two, because the model needs to download before Apache Beam can start running inference." + ], + "metadata": { + "id": "Njl0QfrLxQ0m" + } + }, + { + "cell_type": "code", + "source": [ + "with beam.Pipeline() as p:\n", + " _ = (p | beam.Create(prompts) # Create a PCollection of the prompts.\n", + " | RunInference(model_handler) # Send the prompts to the model and get responses.\n", + " | beam.ParDo(FormatOutput()) # Format the output.\n", + " | beam.Map(print) # Print the formatted output.\n", + " )" + ], + "metadata": { + "id": "9yXbzV0ZmZcJ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Run remotely on Dataflow\n", + "\n", + "After you validate that the pipeline can run against a vLLM locally, you can productionalize the workflow on a remote runner. This notebook runs the pipeline on the Dataflow runner." + ], + "metadata": { + "id": "Jv7be6Pk9Hlx" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Build a Docker image\n", + "\n", + "To run a pipeline with vLLM on Dataflow, you must create a Docker image that contains your dependencies and is compatible with a GPU runtime. For more information about building GPU compatible Dataflow containers, see [Build a custom container image](https://cloud.google.com/dataflow/docs/gpu/use-gpus#custom-container) in the Datafow documentation.\n", + "\n", + "First, define and save your Dockerfile. This file uses an Nvidia GPU-compatible base image. In the Dockerfile, install the Python dependencies needed to run the job.\n", + "\n", + "Before proceeding, make sure that your configuration meets the following requirements:\n", + "\n", + "- The Python version in the following cell matches the Python version defined in the Dockerfile.\n", + "- The Apache Beam version defined in your dependencies matches the Apache Beam version defined in the Dockerfile." + ], + "metadata": { + "id": "J1LMrl1Yy6QB" + } + }, + { + "cell_type": "code", + "source": [ + "!python --version" + ], + "metadata": { + "id": "jCQ6-D55gqfl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "cell_str='''\n", + "FROM nvidia/cuda:12.4.1-devel-ubuntu22.04\n", + "\n", + "RUN apt update\n", + "RUN apt install software-properties-common -y\n", + "RUN add-apt-repository ppa:deadsnakes/ppa\n", + "RUN apt update\n", + "RUN apt-get update\n", + "\n", + "ARG DEBIAN_FRONTEND=noninteractive\n", + "\n", + "RUN apt install python3.10-full -y\n", + "# RUN apt install python3.10-venv -y\n", + "# RUN apt install python3.10-dev -y\n", + "RUN rm /usr/bin/python3\n", + "RUN ln -s python3.10 /usr/bin/python3\n", + "RUN python3 --version\n", + "RUN apt-get install -y curl\n", + "RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 && pip install --upgrade pip\n", + "\n", + "# Copy the Apache Beam worker dependencies from the Beam Python 3.10 SDK image.\n", + "COPY --from=apache/beam_python3.10_sdk:2.60.0 /opt/apache/beam /opt/apache/beam\n", + "\n", + "RUN pip install --no-cache-dir -vvv apache-beam[gcp]==2.60.0\n", + "RUN pip install openai>=1.52.2 vllm>=0.6.3\n", + "\n", + "RUN apt install libcairo2-dev pkg-config python3-dev -y\n", + "RUN pip install pycairo\n", + "\n", + "# Set the entrypoint to Apache Beam SDK worker launcher.\n", + "ENTRYPOINT [ \"/opt/apache/beam/boot\" ]\n", + "'''\n", + "\n", + "with open('VllmDockerfile', 'w') as f:\n", + " f.write(cell_str)" + ], + "metadata": { + "id": "7QyNq_gygHLO" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "After you save the Dockerfile, build and push your Docker image. Because Docker is not accessible from Colab, you need to complete this step in a separate environment.\n", + "\n", + "1. In the sidebar, click **Files** to open the **Files** pane.\n", + "2. In an environment with Docker installed, download the file **VllmDockerfile** file to an empty folder.\n", + "3. Run the following commands. Replace `` with a valid [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) repository.\n", + "\n", + " ```\n", + " docker build -t \":latest\" -f VllmDockerfile ./\n", + " docker image push \":latest\"\n", + " ```" + ], + "metadata": { + "id": "zWma0YetiEn5" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Define and run the pipeline\n", + "\n", + "When you have a working Docker image, define and run your pipeline.\n", + "\n", + "First, define the pipeline options that you want to use to launch the Dataflow job. Before running the next cell, replace the following variables:\n", + "\n", + "- ``: the name of a valid [Google Cloud Storage](https://cloud.google.com/storage?e=48754805&hl=en) bucket. Don't include a `gs://` prefix or trailing slashes.\n", + "- ``: the name of the Google Artifact Registry repository that you used in the previous step. Don't include the `latest` tag, because this tag is automatically appended as part of the cell.\n", + "- ``: the name of the Google Cloud project that you created your bucket and Artifact Registry repository in.\n", + "\n", + "This workflow uses the following Dataflow service option: `worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver:5xx`. When you use this service option, Dataflow to installs a T4 GPU that uses a `5xx` series Nvidia driver on each worker machine. The 5xx driver is required to run vLLM jobs." + ], + "metadata": { + "id": "NjZyRjte0g0Q" + } + }, + { + "cell_type": "code", + "source": [ + "\n", + "from apache_beam.options.pipeline_options import GoogleCloudOptions\n", + "from apache_beam.options.pipeline_options import PipelineOptions\n", + "from apache_beam.options.pipeline_options import SetupOptions\n", + "from apache_beam.options.pipeline_options import StandardOptions\n", + "from apache_beam.options.pipeline_options import WorkerOptions\n", + "\n", + "\n", + "options = PipelineOptions()\n", + "\n", + "BUCKET_NAME = '' # Replace with your bucket name.\n", + "CONTAINER_LOCATION = '' # Replace with your container location ( from the previous step)\n", + "PROJECT_NAME = '' # Replace with your GCP project\n", + "\n", + "options.view_as(GoogleCloudOptions).project = PROJECT_NAME\n", + "\n", + "# Provide required pipeline options for the Dataflow Runner.\n", + "options.view_as(StandardOptions).runner = \"DataflowRunner\"\n", + "\n", + "# Set the Google Cloud region that you want to run Dataflow in.\n", + "options.view_as(GoogleCloudOptions).region = 'us-central1'\n", + "\n", + "# IMPORTANT: Replace BUCKET_NAME with the the name of your Cloud Storage bucket.\n", + "dataflow_gcs_location = \"gs://%s/dataflow\" % BUCKET_NAME\n", + "\n", + "# The Dataflow staging location. This location is used to stage the Dataflow pipeline and the SDK binary.\n", + "options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location\n", + "\n", + "\n", + "# The Dataflow staging location. This location is used to stage the Dataflow pipeline and the SDK binary.\n", + "options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location\n", + "\n", + "# The Dataflow temp location. This location is used to store temporary files or intermediate results before outputting to the sink.\n", + "options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location\n", + "\n", + "# Enable GPU runtime. Make sure to enable 5xx driver since vLLM only works with 5xx drivers, not 4xx\n", + "options.view_as(GoogleCloudOptions).dataflow_service_options = [\"worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver:5xx\"]\n", + "\n", + "options.view_as(SetupOptions).save_main_session = True\n", + "\n", + "# Choose a machine type compatible with GPU type\n", + "options.view_as(WorkerOptions).machine_type = \"n1-standard-4\"\n", + "\n", + "options.view_as(WorkerOptions).worker_harness_container_image = '%s:latest' % CONTAINER_LOCATION" + ], + "metadata": { + "id": "kXy9FRYVCSjq" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, authenticate Colab so that it can to submit a job on your behalf." + ], + "metadata": { + "id": "xPhe597P1-QJ" + } + }, + { + "cell_type": "code", + "source": [ + "def auth_to_colab():\n", + " from google.colab import auth\n", + " auth.authenticate_user()\n", + "\n", + "auth_to_colab()" + ], + "metadata": { + "id": "Xkf6yIVlFB8-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, run the pipeline on Dataflow. The pipeline definition is almost exactly the same as the definition used for local execution. The pipeline options are the only change to the pipeline.\n", + "\n", + "The following code creates a Dataflow job in your project. You can view the results in Colab or in the Google Cloud console. Creating a Dataflow job and downloading the model might take a few minutes. After the job starts performing inference, it quickly runs through the inputs." + ], + "metadata": { + "id": "MJtEI6Ux2eza" + } + }, + { + "cell_type": "code", + "source": [ + "import logging\n", + "from apache_beam.ml.inference.base import RunInference\n", + "from apache_beam.ml.inference.vllm_inference import VLLMCompletionsModelHandler\n", + "from apache_beam.ml.inference.base import PredictionResult\n", + "import apache_beam as beam\n", + "\n", + "class FormatOutput(beam.DoFn):\n", + " def process(self, element, *args, **kwargs):\n", + " yield \"Input: {input}, Output: {output}\".format(input=element.example, output=element.inference)\n", + "\n", + "prompts = [\n", + " \"Hello, my name is\",\n", + " \"The president of the United States is\",\n", + " \"The capital of France is\",\n", + " \"The future of AI is\",\n", + " \"John cena is\",\n", + "]\n", + "\n", + "# Specify the model handler, providing a path and the custom inference function.\n", + "model_handler = VLLMCompletionsModelHandler('facebook/opt-125m')\n", + "\n", + "with beam.Pipeline(options=options) as p:\n", + " _ = (p | beam.Create(prompts) # Create a PCollection of the prompts.\n", + " | RunInference(model_handler) # Send the prompts to the model and get responses.\n", + " | beam.ParDo(FormatOutput()) # Format the output.\n", + " | beam.Map(logging.info) # Print the formatted output.\n", + " )" + ], + "metadata": { + "id": "8gjDdru_9Dii" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Run vLLM with a Gemma model\n", + "\n", + "After you configure your pipeline, switching the model used by the pipeline is relatively straightforward. You can run the same pipeline, but switch the model name defined in the model handler. This example runs the pipeline created previously but uses a [Gemma](https://ai.google.dev/gemma) model.\n", + "\n", + "Before you start, sign in to HuggingFace, and make sure that you can access the Gemma models. To access Gemma models, you must accept the terms and conditions.\n", + "\n", + "1. Navigate to the [Gemma Model Card](https://huggingface.co/google/gemma-2b).\n", + "2. Sign in, or sign up for a free HuggingFace account.\n", + "3. Follow the prompts to agree to the conditions\n", + "\n", + "When you complete these steps, the following message appears on the model card page: `You have been granted access to this model`.\n", + "\n", + "Next, sign in to your account from this notebook by running the following code and then following the prompts." + ], + "metadata": { + "id": "22cEHPCc28fH" + } + }, + { + "cell_type": "code", + "source": [ + "! huggingface-cli login" + ], + "metadata": { + "id": "JHwIsFI9kd9j" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Verify that the notebook can now access the Gemma model. Run the following code, which starts a vLLM server to serve the Gemma 2b model. Because the default T4 Colab runtime doesn't support the full data type precision needed to run Gemma models, the `--dtype=half` parameter is required.\n", + "\n", + "When successful, the following cell runs indefinitely. After it starts the server process, you can shut it down. When the server process starts, the Gemma 2b model is successfully downloaded, and the server is ready to serve traffic." + ], + "metadata": { + "id": "IjX2If8rnCol" + } + }, + { + "cell_type": "code", + "source": [ + "! python -m vllm.entrypoints.openai.api_server --model google/gemma-2b --dtype=half" + ], + "metadata": { + "id": "LH_oCFWMiwFs" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "To run the pipeline in Apache Beam, run the following code. Update the `VLLMCompletionsModelHandler` object with the new parameters, which match the command from the previous cell. Reuse all of the pipeline logic from the previous pipelines." + ], + "metadata": { + "id": "31BmdDUAn-SW" + } + }, + { + "cell_type": "code", + "source": [ + "model_handler = VLLMCompletionsModelHandler('google/gemma-2b', vllm_server_kwargs={'dtype': 'half'})\n", + "\n", + "with beam.Pipeline() as p:\n", + " _ = (p | beam.Create(prompts) # Create a PCollection of the prompts.\n", + " | RunInference(model_handler) # Send the prompts to the model and get responses.\n", + " | beam.ParDo(FormatOutput()) # Format the output.\n", + " | beam.Map(print) # Print the formatted output.\n", + " )" + ], + "metadata": { + "id": "DyC2ikXg237p" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Run Gemma on Dataflow\n", + "\n", + "As a next step, run this pipeline on Dataflow. Follow the same steps described in the \"Run remotely on Dataflow\" section of this page:\n", + "\n", + "1. Construct a Dockerfile and push a new Docker image. You can use the same Dockerfile that you created previously, but you need to add a step to set your HuggingFace authentication key. In your Dockerfile, add the following line before the entrypoint:\n", + "\n", + " ```\n", + " RUN python3 -c 'from huggingface_hub import HfFolder; HfFolder.save_token(\"\")'\n", + " ```\n", + "\n", + "2. Set pipeline options. You can reuse the options defined in this notebook. Replace the Docker image location with your new Docker image.\n", + "3. Run the pipeline. Copy the pipeline that you ran on Dataflow, and replace the pipeline options with the pipeline options that you just defined.\n", + "\n" + ], + "metadata": { + "id": "C6OYfub6ovFK" + } + } + ] +}