Update code/readme/UT for Ray Serve and VLLM (#298)

* make vllm fully runnable Signed-off-by: Xinyao Wang <[email protected]> * add ut for vllm Signed-off-by: Xinyao Wang <[email protected]> * update readme for ray serve Signed-off-by: Xinyao Wang <[email protected]> * fix bugs in ray serve Signed-off-by: Xinyao Wang <[email protected]> * refine code Signed-off-by: Xinyao Wang <[email protected]> * add ut for ray serve Signed-off-by: Xinyao Wang <[email protected]> * refine parameters for vllm Signed-off-by: Xinyao Wang <[email protected]> * fix bug in ut for ray serve Signed-off-by: Xinyao Wang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Xinyao Wang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Jul 15, 2024 · dd939c5 · dd939c5
1 parent 2d67724
commit dd939c5
Show file tree

Hide file tree

Showing 16 changed files with 404 additions and 49 deletions.
diff --git a/comps/llms/text-generation/ray_serve/README.md b/comps/llms/text-generation/ray_serve/README.md
@@ -2,45 +2,77 @@
 
 [Ray](https://docs.ray.io/en/latest/serve/index.html) is an LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), has native support for autoscaling and multi-node deployments, which is easy to use for LLM inference serving on Intel Gaudi2 accelerators. The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Please visit [Habana AI products](<(https://habana.ai/products)>) for more details.
 
-## Getting Started
+## set up environment variables
 
-### Launch Ray Gaudi Service
+```bash
+export HUGGINGFACEHUB_API_TOKEN=<token>
+export RAY_Serve_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL="meta-llama/Llama-2-7b-chat-hf"
+```
+
+For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
+
+## Set up Ray Serve Service
+
+### Build docker
 
 ```bash
-bash ./launch_ray_service.sh
+bash build_docker_rayserve.sh
 ```
 
-For gated models such as `LLAMA-2`, you need set the environment variable `HF_TOKEN=<token>` to access the Hugging Face Hub.
+### Launch Ray Serve service
+
+```bash
+bash launch_ray_service.sh
+```
+
+The `launch_vllm_service.sh` script accepts five parameters:
+
+- port_number: The port number assigned to the Ray Gaudi endpoint, with the default being 8008.
+- model_name: The model name utilized for LLM, with the default set to meta-llama/Llama-2-7b-chat-hf.
+- chat_processor: The chat processor for handling the prompts, with the default set to 'ChatModelNoFormat', and the optional selection can be 'ChatModelLlama', 'ChatModelGptJ" and "ChatModelGemma'.
+- num_cpus_per_worker: The number of CPUs specifies the number of CPUs per worker process.
+- num_hpus_per_worker: The number of HPUs specifies the number of HPUs per worker process.
 
-Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
+If you want to customize the port or model_name, can run:
 
 ```bash
-export HF_TOKEN=<token>
+bash ./launch_ray_service.sh ${port_number} ${model_name} ${chat_processor} ${num_cpus_per_worker} ${num_hpus_per_worker}
 ```
 
+### Query the service
+
 And then you can make requests with the OpenAI-compatible APIs like below to check the service status:
 
 ```bash
-curl http://172.17.0.1:8008/v1/chat/completions   \
+curl http://${your_ip}:8008/v1/chat/completions   \
   -H "Content-Type: application/json"   \
-  -d '{"model": <model_name>, "messages": [{"role": "user", "content": "How are you?"}], "max_tokens": 32 }'
+  -d '{"model": "Llama-2-7b-chat-hf", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens": 32 }'
 ```
 
 For more information about the OpenAI APIs, you can checkeck the [OpenAI official document](https://platform.openai.com/docs/api-reference/).
 
-#### Customize Ray Gaudi Service
+## Set up OPEA microservice
 
-The ./serving/ray/launch_ray_service.sh script accepts five parameters:
+Then we warp the Ray Serve service into OPEA microcervice.
 
-- **port_number**: The port number assigned to the Ray Gaudi endpoint, with the default being 8080.
-- model_name: The model name utilized for LLM, with the default set to "meta-llama/Llama-2-7b-chat-hf".
-- chat_processor: The chat processor for handling the prompts, with the default set to "ChatModelNoFormat", and the optional selection can be "ChatModelLlama", "ChatModelGptJ" and "ChatModelGemma".
-- num_cpus_per_worker: The number of CPUs specifies the number of CPUs per worker process.
-- num_hpus_per_worker: The number of HPUs specifies the number of HPUs per worker process.
+### Build docker
+
+```bash
+bash build_docker_microservice.sh
+```
+
+### Launch the microservice
+
+```bash
+bash launch_microservice.sh
+```
 
-You have the flexibility to customize five parameters according to your specific needs. Additionally, you can set the Ray Gaudi endpoint by exporting the environment variable `RAY_Serve_ENDPOINT`:
+### Query the microservice
 
 ```bash
-export RAY_Serve_ENDPOINT="http://xxx.xxx.xxx.xxx:8008"
-export LLM_MODEL=<model_name> # example: export LLM_MODEL="meta-llama/Llama-2-7b-chat-hf"
+curl http://${your_ip}:9000/v1/chat/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
+  -H 'Content-Type: application/json'
 ```
diff --git a/comps/llms/text-generation/ray_serve/api_server_openai.py b/comps/llms/text-generation/ray_serve/api_server_openai.py
@@ -121,7 +121,9 @@ def main(argv=None):
     ).bind(infer_conf, infer_conf["max_num_seqs"], infer_conf["max_batch_size"])
     deployment = edict(deployment)
     openai_serve_run(deployment, host, route_prefix, port, infer_conf["max_concurrent_queries"])
-    input("Service is deployed successfully.")
+    # input("Service is deployed successfully.")
+    while 1:
+        pass
 
 
 if __name__ == "__main__":

diff --git a/comps/llms/text-generation/ray_serve/build_docker_microservice.sh b/comps/llms/text-generation/ray_serve/build_docker_microservice.sh
@@ -0,0 +1,9 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+cd ../../../../
+docker build \
+    -t opea/llm-ray:latest \
+    --build-arg https_proxy=$https_proxy \
+    --build-arg http_proxy=$http_proxy \
+    -f comps/llms/text-generation/ray_serve/docker/Dockerfile.microservice .
diff --git a/comps/llms/text-generation/ray_serve/docker/Dockerfile.rayserve b/comps/llms/text-generation/ray_serve/docker/Dockerfile.rayserve
@@ -1,7 +1,8 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
+# FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
+FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
 
 ENV LANG=en_US.UTF-8
 

diff --git a/comps/llms/text-generation/ray_serve/launch_microservice.sh b/comps/llms/text-generation/ray_serve/launch_microservice.sh
@@ -0,0 +1,13 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+docker run -d --rm \
+    --name="llm-ray-server" \
+    -p 9000:9000 \
+    --ipc=host \
+    -e http_proxy=$http_proxy \
+    -e https_proxy=$https_proxy \
+    -e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT \
+    -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
+    -e LLM_MODEL=$LLM_MODEL \
+    opea/llm-ray:latest
diff --git a/comps/llms/text-generation/ray_serve/launch_ray_service.sh b/comps/llms/text-generation/ray_serve/launch_ray_service.sh
@@ -31,4 +31,16 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 5 ]; then
 fi
 
 # Build the Docker run command based on the number of cards
-docker run -it --runtime=habana --name="ray-service" -v $PWD/data:/data -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p $port_number:80 -e HF_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=True ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
+docker run -d --rm \
+    --runtime=habana \
+    --name="ray-service" \
+    -v $PWD/data:/data \
+    -e HABANA_VISIBLE_DEVICES=all \
+    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
+    --cap-add=sys_nice \
+    --ipc=host \
+    -p $port_number:80 \
+    -e HF_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
+    -e TRUST_REMOTE_CODE=True \
+    ray_serve:habana \
+    /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
diff --git a/comps/llms/text-generation/ray_serve/llm.py b/comps/llms/text-generation/ray_serve/llm.py
@@ -44,6 +44,8 @@ def post_process_text(text: str):
 def llm_generate(input: LLMParamsDoc):
     llm_endpoint = os.getenv("RAY_Serve_ENDPOINT", "http://localhost:8080")
     llm_model = os.getenv("LLM_MODEL", "Llama-2-7b-chat-hf")
+    if "/" in llm_model:
+        llm_model = llm_model.split("/")[-1]
     llm = ChatOpenAI(
         openai_api_base=llm_endpoint + "/v1",
         model_name=llm_model,

diff --git a/comps/llms/text-generation/vllm/README.md b/comps/llms/text-generation/vllm/README.md
@@ -2,56 +2,55 @@
 
 [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
 
-## vLLM on CPU
+## set up environment variables
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=<token>
+export vLLM_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
+```
+
+For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
+
+## Set up VLLM Ray Service
+
+### vLLM on CPU
 
 First let's enable VLLM on CPU.
 
-### Build docker
+#### Build docker
 
 ```bash
 bash ./build_docker_vllm.sh
 ```
 
 The `build_docker_vllm` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
 
-### Launch vLLM service
+#### Launch vLLM service
 
 ```bash
 bash ./launch_vllm_service.sh
 ```
 
-The `launch_vllm_service.sh` script accepts four parameters:
-
-- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
-- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
-- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
-- parallel_number: parallel nodes number for 'hpu' mode
-
 If you want to customize the port or model_name, can run:
 
 ```bash
 bash ./launch_vllm_service.sh ${port_number} ${model_name}
 ```
 
-For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
-
-```bash
-export HUGGINGFACEHUB_API_TOKEN=<token>
-```
-
-## vLLM on Gaudi
+### vLLM on Gaudi
 
 Then we show how to enable VLLM on Gaudi.
 
-### Build docker
+#### Build docker
 
 ```bash
 bash ./build_docker_vllm.sh hpu
 ```
 
 Set `hw_mode` to `hpu`.
 
-### Launch vLLM service on single node
+#### Launch vLLM service on single node
 
 For small model, we can just use single node.
 
@@ -61,7 +60,19 @@ bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu 1
 
 Set `hw_mode` to `hpu` and `parallel_number` to 1.
 
-### Launch vLLM service on multiple nodes
+The `launch_vllm_service.sh` script accepts 7 parameters:
+
+- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
+- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
+- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
+- parallel_number: parallel nodes number for 'hpu' mode
+- block_size: default set to 128 for better performance on HPU
+- max_num_seqs: default set to 256 for better performance on HPU
+- max_seq_len_to_capture: default set to 2048 for better performance on HPU
+
+If you want to get more performance tuning tips, can refer to [Performance tuning](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#performance-tips).
+
+#### Launch vLLM service on multiple nodes
 
 For large model such as `meta-llama/Meta-Llama-3-70b`, we need to launch on multiple nodes.
 
@@ -75,17 +86,42 @@ For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use fo
 bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
 ```
 
-## Query the service
+### Query the service
 
 And then you can make requests like below to check the service status:
 
 ```bash
-curl http://127.0.0.1:8008/v1/completions \
+curl http://${your_ip}:8008/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
-  "model": <model_name>,
+  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
   "prompt": "What is Deep Learning?",
   "max_tokens": 32,
   "temperature": 0
   }'
 ```
+
+## Set up OPEA microservice
+
+Then we warp the VLLM service into OPEA microcervice.
+
+### Build docker
+
+```bash
+bash build_docker_microservice.sh
+```
+
+### Launch the microservice
+
+```bash
+bash launch_microservice.sh
+```
+
+### Query the microservice
+
+```bash
+curl http://${your_ip}:9000/v1/chat/completions \
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_p":0.95,"temperature":0.01,"streaming":false}' \
+  -H 'Content-Type: application/json
+```
diff --git a/comps/llms/text-generation/vllm/build_docker_microservice.sh b/comps/llms/text-generation/vllm/build_docker_microservice.sh
@@ -0,0 +1,9 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+cd ../../../../
+docker build  \
+    -t opea/llm-vllm:latest \
+    --build-arg https_proxy=$https_proxy \
+    --build-arg http_proxy=$http_proxy \
+    -f comps/llms/text-generation/vllm/docker/Dockerfile.microservice .
diff --git a/comps/llms/text-generation/vllm/docker/Dockerfile.hpu b/comps/llms/text-generation/vllm/docker/Dockerfile.hpu
@@ -1,3 +1,4 @@
+# FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
 FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
 
 ENV LANG=en_US.UTF-8

diff --git a/comps/llms/text-generation/vllm/launch_microservice.sh b/comps/llms/text-generation/vllm/launch_microservice.sh
@@ -0,0 +1,13 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+docker run -d --rm \
+    --name="llm-vllm-server" \
+    -p 9000:9000 \
+    --ipc=host \
+    -e http_proxy=$http_proxy \
+    -e https_proxy=$https_proxy \
+    -e vLLM_ENDPOINT=$vLLM_ENDPOINT \
+    -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
+    -e LLM_MODEL=$LLM_MODEL \
+    opea/llm-vllm:latest
diff --git a/comps/llms/text-generation/vllm/launch_vllm_service.sh b/comps/llms/text-generation/vllm/launch_vllm_service.sh
@@ -4,15 +4,21 @@
 
 # Set default values
 default_port=8008
-default_model="meta-llama/Meta-Llama-3-8B-Instruct"
+default_model=$LLM_MODEL
 default_hw_mode="cpu"
 default_parallel_number=1
+default_block_size=128
+default_max_num_seqs=256
+default_max_seq_len_to_capture=2048
 
 # Assign arguments to variables
 port_number=${1:-$default_port}
 model_name=${2:-$default_model}
 hw_mode=${3:-$default_hw_mode}
 parallel_number=${4:-$default_parallel_number}
+block_size=${5:-$default_block_size}
+max_num_seqs=${6:-$default_max_num_seqs}
+max_seq_len_to_capture=${7:-$default_max_seq_len_to_capture}
 
 # Check if all required arguments are provided
 if [ "$#" -lt 0 ] || [ "$#" -gt 4 ]; then
@@ -21,6 +27,9 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 4 ]; then
     echo "model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'."
     echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
     echo "parallel_number: parallel nodes number for 'hpu' mode"
+    echo "block_size: default set to 128 for better performance on HPU"
+    echo "max_num_seqs: default set to 256 for better performance on HPU"
+    echo "max_seq_len_to_capture: default set to 2048 for better performance on HPU"
     exit 1
 fi
 
@@ -29,7 +38,7 @@ volume=$PWD/data
 
 # Build the Docker run command based on hardware mode
 if [ "$hw_mode" = "hpu" ]; then
-    docker run -d --rm--runtime=habana --rm --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name  --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80"
+    docker run -d --rm --runtime=habana --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name  --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80 --block-size $block_size --max-num-seqs  $max_num_seqs --max-seq-len-to-capture $max_seq_len_to_capture "
 else
     docker run -d --rm --name="vllm-service" -p $port_number:80 --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port 80"
 fi
diff --git a/comps/llms/text-generation/vllm/llm.py b/comps/llms/text-generation/vllm/llm.py
@@ -31,16 +31,15 @@ def post_process_text(text: str):
 )
 @traceable(run_type="llm")
 def llm_generate(input: LLMParamsDoc):
-    llm_endpoint = os.getenv("vLLM_LLM_ENDPOINT", "http://localhost:8008")
-    model_name = os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")
+    llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8008")
+    model_name = os.getenv("LLM_MODEL", "meta-llama/Meta-Llama-3-8B-Instruct")
     llm = VLLMOpenAI(
         openai_api_key="EMPTY",
         openai_api_base=llm_endpoint + "/v1",
         max_tokens=input.max_new_tokens,
         model_name=model_name,
         top_p=input.top_p,
         temperature=input.temperature,
-        presence_penalty=input.repetition_penalty,
         streaming=input.streaming,
     )