From cd22094166ed98c56e200c8dfafc89f58546194e Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Thu, 7 Dec 2023 02:25:00 -0800 Subject: [PATCH] Update README and versions for 23.12 branch --- README.md | 180 +----------------------------------------------------- 1 file changed, 1 insertion(+), 179 deletions(-) diff --git a/README.md b/README.md index 3bf38476..4c7ba05f 100644 --- a/README.md +++ b/README.md @@ -28,182 +28,4 @@ [![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause) -**LATEST RELEASE: You are currently on the main branch which tracks -under-development progress towards the next release. The current release branch -is [r23.10](https://github.com/triton-inference-server/vllm_backend/tree/r23.10) -and which corresponds to the 23.10 container release on -[NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).** - -# vLLM Backend - -The Triton backend for [vLLM](https://github.com/vllm-project/vllm) -is designed to run -[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) -on a -[vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py). -You can learn more about Triton backends in the [backend -repo](https://github.com/triton-inference-server/backend). - - -This is a [Python-based backend](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends). -When using this backend, all requests are placed on the -vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled -by the vLLM engine. - -Where can I ask general questions about Triton and Triton backends? -Be sure to read all the information below as well as the [general -Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server) -available in the main [server](https://github.com/triton-inference-server/server) -repo. If you don't find your answer there you can ask questions on the -main Triton [issues page](https://github.com/triton-inference-server/server/issues). - -## Installing the vLLM Backend - -There are several ways to install and deploy the vLLM backend. - -### Option 1. Use the Pre-Built Docker Container. - -Pull a `tritonserver:-vllm-python-py3` container with vLLM backend from the -[NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) -registry. \ is the version of Triton that you want to use. Please note, -that Triton's vLLM container has been introduced starting from 23.10 release. - -``` -docker pull nvcr.io/nvidia/tritonserver:-vllm-python-py3 -``` - -### Option 2. Build a Custom Container From Source -You can follow steps described in the -[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker) -guide and use the -[build.py](https://github.com/triton-inference-server/server/blob/main/build.py) -script. - -A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs. - -``` -./build.py -v --enable-logging - --enable-stats - --enable-tracing - --enable-metrics - --enable-gpu-metrics - --enable-cpu-metrics - --enable-gpu - --filesystem=gcs - --filesystem=s3 - --filesystem=azure_storage - --endpoint=http - --endpoint=grpc - --endpoint=sagemaker - --endpoint=vertex-ai - --upstream-container-version=23.10 - --backend=python:r23.10 - --backend=vllm:r23.10 -``` - -### Option 3. Add the vLLM Backend to the Default Triton Container - -You can install the vLLM backend directly into the NGC Triton container. -In this case, please install vLLM first. You can do so by running -`pip install vllm==`. Then, set up the vLLM backend in the -container with the following commands: - -``` -mkdir -p /opt/tritonserver/backends/vllm -wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py -``` - -## Using the vLLM Backend - -You can see an example -[model_repository](samples/model_repository) -in the [samples](samples) folder. -You can use this as is and change the model by changing the `model` value in `model.json`. -`model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model. -You can see supported arguments in vLLM's -[arg_utils.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py). -Specifically, -[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11) -and -[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201). - -For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in -[model.json](samples/model_repository/vllm_model/1/model.json). - -Note: vLLM greedily consume up to 90% of the GPU's memory under default settings. -The sample model updates this behavior by setting gpu_memory_utilization to 50%. -You can tweak this behavior using fields like gpu_memory_utilization and other settings in -[model.json](samples/model_repository/vllm_model/1/model.json). - -### Launching Triton Inference Server - -Once you have the model repository set up, it is time to launch the Triton server. -We will use the [pre-built Triton container with vLLM backend](#option-1-use-the-pre-built-docker-container) from -[NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) in this example. - -``` -docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:-vllm-python-py3 tritonserver --model-repository ./model_repository -``` - -Replace \ with the version of Triton that you want to use. -Note that Triton's vLLM container was first published starting from -23.10 release. - -After you start Triton you will see output on the console showing -the server starting up and loading the model. When you see output -like the following, Triton is ready to accept inference requests. - -``` -I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001 -I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000 -I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002 -``` - -### Sending Your First Inference - -After you -[start Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html) -with the -[sample model_repository](samples/model_repository), -you can quickly run your first inference request with the -[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md). - -Try out the command below. - -``` -$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}' -``` - -Upon success, you should see a response from the server like this one: -``` -{"model_name":"vllm_model","model_version":"1","text_output":"What is Triton Inference Server?\n\nTriton Inference Server is a server that is used by many"} -``` - -In the [samples](samples) folder, you can also find a sample client, -[client.py](samples/client.py) which uses Triton's -[asyncio gRPC client library](https://github.com/triton-inference-server/client#python-asyncio-support-beta-1) -to run inference on Triton. - -### Running the Latest vLLM Version - -You can check the vLLM version included in Triton Inference Server from -[Framework Containers Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). -*Note:* The vLLM Triton Inference Server container has been introduced -starting from 23.10 release. - -You can use `pip install ...` within the container to upgrade vLLM version. - - -## Running Multiple Instances of Triton Server - -If you are running multiple instances of Triton server with a Python-based backend, -you need to specify a different `shm-region-prefix-name` for each server. See -[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server) -for more information. - -## Referencing the Tutorial - -You can read further in the -[vLLM Quick Deploy guide](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM) -in the -[tutorials](https://github.com/triton-inference-server/tutorials/) repository. \ No newline at end of file +NOTE: You are currently on the r23.12 branch which tracks stabilization towards the next release. This branch is not usable during stabilization.