Skip to content

Commit

Permalink
Update code/readme/UT for Ray Serve and VLLM (#298)
Browse files Browse the repository at this point in the history
* make vllm fully runnable

Signed-off-by: Xinyao Wang <[email protected]>

* add ut for vllm

Signed-off-by: Xinyao Wang <[email protected]>

* update readme for ray serve

Signed-off-by: Xinyao Wang <[email protected]>

* fix bugs in ray serve

Signed-off-by: Xinyao Wang <[email protected]>

* refine code

Signed-off-by: Xinyao Wang <[email protected]>

* add ut for ray serve

Signed-off-by: Xinyao Wang <[email protected]>

* refine parameters for vllm

Signed-off-by: Xinyao Wang <[email protected]>

* fix bug in ut for ray serve

Signed-off-by: Xinyao Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Xinyao Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
XinyaoWa and pre-commit-ci[bot] authored Jul 15, 2024
1 parent 2d67724 commit dd939c5
Show file tree
Hide file tree
Showing 16 changed files with 404 additions and 49 deletions.
68 changes: 50 additions & 18 deletions comps/llms/text-generation/ray_serve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,45 +2,77 @@

[Ray](https://docs.ray.io/en/latest/serve/index.html) is an LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs, built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), has native support for autoscaling and multi-node deployments, which is easy to use for LLM inference serving on Intel Gaudi2 accelerators. The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Please visit [Habana AI products](<(https://habana.ai/products)>) for more details.

## Getting Started
## set up environment variables

### Launch Ray Gaudi Service
```bash
export HUGGINGFACEHUB_API_TOKEN=<token>
export RAY_Serve_ENDPOINT="http://${your_ip}:8008"
export LLM_MODEL="meta-llama/Llama-2-7b-chat-hf"
```

For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.

## Set up Ray Serve Service

### Build docker

```bash
bash ./launch_ray_service.sh
bash build_docker_rayserve.sh
```

For gated models such as `LLAMA-2`, you need set the environment variable `HF_TOKEN=<token>` to access the Hugging Face Hub.
### Launch Ray Serve service

```bash
bash launch_ray_service.sh
```

The `launch_vllm_service.sh` script accepts five parameters:

- port_number: The port number assigned to the Ray Gaudi endpoint, with the default being 8008.
- model_name: The model name utilized for LLM, with the default set to meta-llama/Llama-2-7b-chat-hf.
- chat_processor: The chat processor for handling the prompts, with the default set to 'ChatModelNoFormat', and the optional selection can be 'ChatModelLlama', 'ChatModelGptJ" and "ChatModelGemma'.
- num_cpus_per_worker: The number of CPUs specifies the number of CPUs per worker process.
- num_hpus_per_worker: The number of HPUs specifies the number of HPUs per worker process.

Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
If you want to customize the port or model_name, can run:

```bash
export HF_TOKEN=<token>
bash ./launch_ray_service.sh ${port_number} ${model_name} ${chat_processor} ${num_cpus_per_worker} ${num_hpus_per_worker}
```

### Query the service

And then you can make requests with the OpenAI-compatible APIs like below to check the service status:

```bash
curl http://172.17.0.1:8008/v1/chat/completions \
curl http://${your_ip}:8008/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": <model_name>, "messages": [{"role": "user", "content": "How are you?"}], "max_tokens": 32 }'
-d '{"model": "Llama-2-7b-chat-hf", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens": 32 }'
```

For more information about the OpenAI APIs, you can checkeck the [OpenAI official document](https://platform.openai.com/docs/api-reference/).

#### Customize Ray Gaudi Service
## Set up OPEA microservice

The ./serving/ray/launch_ray_service.sh script accepts five parameters:
Then we warp the Ray Serve service into OPEA microcervice.

- **port_number**: The port number assigned to the Ray Gaudi endpoint, with the default being 8080.
- model_name: The model name utilized for LLM, with the default set to "meta-llama/Llama-2-7b-chat-hf".
- chat_processor: The chat processor for handling the prompts, with the default set to "ChatModelNoFormat", and the optional selection can be "ChatModelLlama", "ChatModelGptJ" and "ChatModelGemma".
- num_cpus_per_worker: The number of CPUs specifies the number of CPUs per worker process.
- num_hpus_per_worker: The number of HPUs specifies the number of HPUs per worker process.
### Build docker

```bash
bash build_docker_microservice.sh
```

### Launch the microservice

```bash
bash launch_microservice.sh
```

You have the flexibility to customize five parameters according to your specific needs. Additionally, you can set the Ray Gaudi endpoint by exporting the environment variable `RAY_Serve_ENDPOINT`:
### Query the microservice

```bash
export RAY_Serve_ENDPOINT="http://xxx.xxx.xxx.xxx:8008"
export LLM_MODEL=<model_name> # example: export LLM_MODEL="meta-llama/Llama-2-7b-chat-hf"
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \
-H 'Content-Type: application/json'
```
4 changes: 3 additions & 1 deletion comps/llms/text-generation/ray_serve/api_server_openai.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,9 @@ def main(argv=None):
).bind(infer_conf, infer_conf["max_num_seqs"], infer_conf["max_batch_size"])
deployment = edict(deployment)
openai_serve_run(deployment, host, route_prefix, port, infer_conf["max_concurrent_queries"])
input("Service is deployed successfully.")
# input("Service is deployed successfully.")
while 1:
pass


if __name__ == "__main__":
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

cd ../../../../
docker build \
-t opea/llm-ray:latest \
--build-arg https_proxy=$https_proxy \
--build-arg http_proxy=$http_proxy \
-f comps/llms/text-generation/ray_serve/docker/Dockerfile.microservice .
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
# FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

ENV LANG=en_US.UTF-8

Expand Down
13 changes: 13 additions & 0 deletions comps/llms/text-generation/ray_serve/launch_microservice.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

docker run -d --rm \
--name="llm-ray-server" \
-p 9000:9000 \
--ipc=host \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT \
-e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
-e LLM_MODEL=$LLM_MODEL \
opea/llm-ray:latest
14 changes: 13 additions & 1 deletion comps/llms/text-generation/ray_serve/launch_ray_service.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,16 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 5 ]; then
fi

# Build the Docker run command based on the number of cards
docker run -it --runtime=habana --name="ray-service" -v $PWD/data:/data -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p $port_number:80 -e HF_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=True ray_serve:habana /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
docker run -d --rm \
--runtime=habana \
--name="ray-service" \
-v $PWD/data:/data \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--ipc=host \
-p $port_number:80 \
-e HF_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
-e TRUST_REMOTE_CODE=True \
ray_serve:habana \
/bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path $model_name --chat_processor $chat_processor --num_cpus_per_worker $num_cpus_per_worker --num_hpus_per_worker $num_hpus_per_worker"
2 changes: 2 additions & 0 deletions comps/llms/text-generation/ray_serve/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ def post_process_text(text: str):
def llm_generate(input: LLMParamsDoc):
llm_endpoint = os.getenv("RAY_Serve_ENDPOINT", "http://localhost:8080")
llm_model = os.getenv("LLM_MODEL", "Llama-2-7b-chat-hf")
if "/" in llm_model:
llm_model = llm_model.split("/")[-1]
llm = ChatOpenAI(
openai_api_base=llm_endpoint + "/v1",
model_name=llm_model,
Expand Down
82 changes: 59 additions & 23 deletions comps/llms/text-generation/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,56 +2,55 @@

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.

## vLLM on CPU
## set up environment variables

```bash
export HUGGINGFACEHUB_API_TOKEN=<token>
export vLLM_ENDPOINT="http://${your_ip}:8008"
export LLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct"
```

For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.

## Set up VLLM Ray Service

### vLLM on CPU

First let's enable VLLM on CPU.

### Build docker
#### Build docker

```bash
bash ./build_docker_vllm.sh
```

The `build_docker_vllm` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.

### Launch vLLM service
#### Launch vLLM service

```bash
bash ./launch_vllm_service.sh
```

The `launch_vllm_service.sh` script accepts four parameters:

- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
- parallel_number: parallel nodes number for 'hpu' mode

If you want to customize the port or model_name, can run:

```bash
bash ./launch_vllm_service.sh ${port_number} ${model_name}
```

For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.

```bash
export HUGGINGFACEHUB_API_TOKEN=<token>
```

## vLLM on Gaudi
### vLLM on Gaudi

Then we show how to enable VLLM on Gaudi.

### Build docker
#### Build docker

```bash
bash ./build_docker_vllm.sh hpu
```

Set `hw_mode` to `hpu`.

### Launch vLLM service on single node
#### Launch vLLM service on single node

For small model, we can just use single node.

Expand All @@ -61,7 +60,19 @@ bash ./launch_vllm_service.sh ${port_number} ${model_name} hpu 1

Set `hw_mode` to `hpu` and `parallel_number` to 1.

### Launch vLLM service on multiple nodes
The `launch_vllm_service.sh` script accepts 7 parameters:

- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8008.
- model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'.
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu".
- parallel_number: parallel nodes number for 'hpu' mode
- block_size: default set to 128 for better performance on HPU
- max_num_seqs: default set to 256 for better performance on HPU
- max_seq_len_to_capture: default set to 2048 for better performance on HPU

If you want to get more performance tuning tips, can refer to [Performance tuning](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#performance-tips).

#### Launch vLLM service on multiple nodes

For large model such as `meta-llama/Meta-Llama-3-70b`, we need to launch on multiple nodes.

Expand All @@ -75,17 +86,42 @@ For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use fo
bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
```

## Query the service
### Query the service

And then you can make requests like below to check the service status:

```bash
curl http://127.0.0.1:8008/v1/completions \
curl http://${your_ip}:8008/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": <model_name>,
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "What is Deep Learning?",
"max_tokens": 32,
"temperature": 0
}'
```

## Set up OPEA microservice

Then we warp the VLLM service into OPEA microcervice.

### Build docker

```bash
bash build_docker_microservice.sh
```

### Launch the microservice

```bash
bash launch_microservice.sh
```

### Query the microservice

```bash
curl http://${your_ip}:9000/v1/chat/completions \
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_p":0.95,"temperature":0.01,"streaming":false}' \
-H 'Content-Type: application/json
```
9 changes: 9 additions & 0 deletions comps/llms/text-generation/vllm/build_docker_microservice.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

cd ../../../../
docker build \
-t opea/llm-vllm:latest \
--build-arg https_proxy=$https_proxy \
--build-arg http_proxy=$http_proxy \
-f comps/llms/text-generation/vllm/docker/Dockerfile.microservice .
1 change: 1 addition & 0 deletions comps/llms/text-generation/vllm/docker/Dockerfile.hpu
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

ENV LANG=en_US.UTF-8
Expand Down
13 changes: 13 additions & 0 deletions comps/llms/text-generation/vllm/launch_microservice.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

docker run -d --rm \
--name="llm-vllm-server" \
-p 9000:9000 \
--ipc=host \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e vLLM_ENDPOINT=$vLLM_ENDPOINT \
-e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \
-e LLM_MODEL=$LLM_MODEL \
opea/llm-vllm:latest
13 changes: 11 additions & 2 deletions comps/llms/text-generation/vllm/launch_vllm_service.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,21 @@

# Set default values
default_port=8008
default_model="meta-llama/Meta-Llama-3-8B-Instruct"
default_model=$LLM_MODEL
default_hw_mode="cpu"
default_parallel_number=1
default_block_size=128
default_max_num_seqs=256
default_max_seq_len_to_capture=2048

# Assign arguments to variables
port_number=${1:-$default_port}
model_name=${2:-$default_model}
hw_mode=${3:-$default_hw_mode}
parallel_number=${4:-$default_parallel_number}
block_size=${5:-$default_block_size}
max_num_seqs=${6:-$default_max_num_seqs}
max_seq_len_to_capture=${7:-$default_max_seq_len_to_capture}

# Check if all required arguments are provided
if [ "$#" -lt 0 ] || [ "$#" -gt 4 ]; then
Expand All @@ -21,6 +27,9 @@ if [ "$#" -lt 0 ] || [ "$#" -gt 4 ]; then
echo "model_name: The model name utilized for LLM, with the default set to 'meta-llama/Meta-Llama-3-8B-Instruct'."
echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
echo "parallel_number: parallel nodes number for 'hpu' mode"
echo "block_size: default set to 128 for better performance on HPU"
echo "max_num_seqs: default set to 256 for better performance on HPU"
echo "max_seq_len_to_capture: default set to 2048 for better performance on HPU"
exit 1
fi

Expand All @@ -29,7 +38,7 @@ volume=$PWD/data

# Build the Docker run command based on hardware mode
if [ "$hw_mode" = "hpu" ]; then
docker run -d --rm--runtime=habana --rm --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80"
docker run -d --rm --runtime=habana --name="vllm-service" -p $port_number:80 -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --tensor-parallel-size $parallel_number --host 0.0.0.0 --port 80 --block-size $block_size --max-num-seqs $max_num_seqs --max-seq-len-to-capture $max_seq_len_to_capture "
else
docker run -d --rm --name="vllm-service" -p $port_number:80 --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port 80"
fi
5 changes: 2 additions & 3 deletions comps/llms/text-generation/vllm/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,15 @@ def post_process_text(text: str):
)
@traceable(run_type="llm")
def llm_generate(input: LLMParamsDoc):
llm_endpoint = os.getenv("vLLM_LLM_ENDPOINT", "http://localhost:8008")
model_name = os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")
llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8008")
model_name = os.getenv("LLM_MODEL", "meta-llama/Meta-Llama-3-8B-Instruct")
llm = VLLMOpenAI(
openai_api_key="EMPTY",
openai_api_base=llm_endpoint + "/v1",
max_tokens=input.max_new_tokens,
model_name=model_name,
top_p=input.top_p,
temperature=input.temperature,
presence_penalty=input.repetition_penalty,
streaming=input.streaming,
)

Expand Down
Loading

0 comments on commit dd939c5

Please sign in to comment.