From cd22094166ed98c56e200c8dfafc89f58546194e Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Thu, 7 Dec 2023 02:25:00 -0800
Subject: [PATCH]  Update README and versions for 23.12 branch

---
 README.md | 180 +-----------------------------------------------------
 1 file changed, 1 insertion(+), 179 deletions(-)

diff --git a/README.md b/README.md
index 3bf38476..4c7ba05f 100644
--- a/README.md
+++ b/README.md
@@ -28,182 +28,4 @@
 
 [![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
 
-**LATEST RELEASE: You are currently on the main branch which tracks
-under-development progress towards the next release. The current release branch
-is [r23.10](https://github.com/triton-inference-server/vllm_backend/tree/r23.10)
-and which corresponds to the 23.10 container release on
-[NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).**
-
-# vLLM Backend
-
-The Triton backend for [vLLM](https://github.com/vllm-project/vllm)
-is designed to run
-[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
-on a
-[vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
-You can learn more about Triton backends in the [backend
-repo](https://github.com/triton-inference-server/backend).
-
-
-This is a [Python-based backend](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends).
-When using this backend, all requests are placed on the
-vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled
-by the vLLM engine.
-
-Where can I ask general questions about Triton and Triton backends?
-Be sure to read all the information below as well as the [general
-Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server)
-available in the main [server](https://github.com/triton-inference-server/server)
-repo. If you don't find your answer there you can ask questions on the
-main Triton [issues page](https://github.com/triton-inference-server/server/issues).
-
-## Installing the vLLM Backend
-
-There are several ways to install and deploy the vLLM backend.
-
-### Option 1. Use the Pre-Built Docker Container.
-
-Pull a `tritonserver:<xx.yy>-vllm-python-py3` container with vLLM backend from the
-[NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
-registry. \<xx.yy\> is the version of Triton that you want to use. Please note,
-that Triton's vLLM container has been introduced starting from 23.10 release.
-
-```
-docker pull nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3
-```
-
-### Option 2. Build a Custom Container From Source
-You can follow steps described in the
-[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
-guide and use the
-[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
-script.
-
-A sample command to build a Triton Server container with all options enabled is shown below. Feel free to customize flags according to your needs.
-
-```
-./build.py -v  --enable-logging
-                --enable-stats
-                --enable-tracing
-                --enable-metrics
-                --enable-gpu-metrics
-                --enable-cpu-metrics
-                --enable-gpu
-                --filesystem=gcs
-                --filesystem=s3
-                --filesystem=azure_storage
-                --endpoint=http
-                --endpoint=grpc
-                --endpoint=sagemaker
-                --endpoint=vertex-ai
-                --upstream-container-version=23.10
-                --backend=python:r23.10
-                --backend=vllm:r23.10
-```
-
-### Option 3. Add the vLLM Backend to the Default Triton Container
-
-You can install the vLLM backend directly into the NGC Triton container.
-In this case, please install vLLM first. You can do so by running
-`pip install vllm==<vLLM_version>`. Then, set up the vLLM backend in the
-container with the following commands:
-
-```
-mkdir -p /opt/tritonserver/backends/vllm
-wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py
-```
-
-## Using the vLLM Backend
-
-You can see an example
-[model_repository](samples/model_repository)
-in the [samples](samples) folder.
-You can use this as is and change the model by changing the `model` value in `model.json`.
-`model.json` represents a key-value dictionary that is fed to vLLM's AsyncLLMEngine when initializing the model.
-You can see supported arguments in vLLM's
-[arg_utils.py](https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py).
-Specifically,
-[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L11)
-and
-[here](https://github.com/vllm-project/vllm/blob/ee8217e5bee5860469204ee57077a91138c9af02/vllm/engine/arg_utils.py#L201).
-
-For multi-GPU support, EngineArgs like tensor_parallel_size can be specified in
-[model.json](samples/model_repository/vllm_model/1/model.json).
-
-Note: vLLM greedily consume up to 90% of the GPU's memory under default settings.
-The sample model updates this behavior by setting gpu_memory_utilization to 50%.
-You can tweak this behavior using fields like gpu_memory_utilization and other settings in
-[model.json](samples/model_repository/vllm_model/1/model.json).
-
-### Launching Triton Inference Server
-
-Once you have the model repository set up, it is time to launch the Triton server.
-We will use the [pre-built Triton container with vLLM backend](#option-1-use-the-pre-built-docker-container) from
-[NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) in this example.
-
-```
-docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-repository ./model_repository
-```
-
-Replace \<xx.yy\> with the version of Triton that you want to use.
-Note that Triton's vLLM container was first published starting from
-23.10 release.
-
-After you start Triton you will see output on the console showing
-the server starting up and loading the model. When you see output
-like the following, Triton is ready to accept inference requests.
-
-```
-I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
-I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
-I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
-```
-
-### Sending Your First Inference
-
-After you
-[start Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
-with the
-[sample model_repository](samples/model_repository),
-you can quickly run your first inference request with the
-[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
-
-Try out the command below.
-
-```
-$ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
-```
-
-Upon success, you should see a response from the server like this one:
-```
-{"model_name":"vllm_model","model_version":"1","text_output":"What is Triton Inference Server?\n\nTriton Inference Server is a server that is used by many"}
-```
-
-In the [samples](samples) folder, you can also find a sample client,
-[client.py](samples/client.py) which uses Triton's
-[asyncio gRPC client library](https://github.com/triton-inference-server/client#python-asyncio-support-beta-1)
-to run inference on Triton.
-
-### Running the Latest vLLM Version
-
-You can check the vLLM version included in Triton Inference Server from
-[Framework Containers Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
-*Note:* The vLLM Triton Inference Server container has been introduced
-starting from 23.10 release.
-
-You can use  `pip install ...` within the container to upgrade vLLM version.
-
-
-## Running Multiple Instances of Triton Server
-
-If you are running multiple instances of Triton server with a Python-based backend,
-you need to specify a different `shm-region-prefix-name` for each server. See
-[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
-for more information.
-
-## Referencing the Tutorial
-
-You can read further in the
-[vLLM Quick Deploy guide](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/vLLM)
-in the
-[tutorials](https://github.com/triton-inference-server/tutorials/) repository.
\ No newline at end of file
+NOTE: You are currently on the r23.12 branch which tracks stabilization towards the next release. This branch is not usable during stabilization.