Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable vLLM Gaudi support for LLM service based on officially habana vllm release #137

Merged
merged 8 commits into from
Jun 12, 2024
14 changes: 9 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,8 @@ The initially supported `Microservices` are described in the below table. More `
<td>Dataprep on Xeon CPU</td>
</tr>
<tr>
<td rowspan="5"><a href="./comps/llms/README.md">LLM</a></td>
<td rowspan="5"><a href="https://www.langchain.com">LangChain</a></td>
<td rowspan="6"><a href="./comps/llms/README.md">LLM</a></td>
<td rowspan="6"><a href="https://www.langchain.com">LangChain</a></td>
<td rowspan="2"><a href="https://huggingface.co/Intel/neural-chat-7b-v3-3">Intel/neural-chat-7b-v3-3</a></td>
<td><a href="https://github.com/huggingface/tgi-gaudi">TGI Gaudi</a></td>
<td>Gaudi2</td>
Expand All @@ -147,7 +147,7 @@ The initially supported `Microservices` are described in the below table. More `
<td>LLM on Xeon CPU</td>
</tr>
<tr>
<td rowspan="2"><a href="https://huggingface.co/meta-llama/Llama-2-7b-chat-hf">meta-llama/Llama-2-7b-chat-hf</a></td>
<td rowspan="2"><a href="https://huggingface.co/Intel/neural-chat-7b-v3-3">Intel/neural-chat-7b-v3-3</a></td>
<td rowspan="2"><a href="https://github.com/ray-project/ray">Ray Serve</a></td>
<td>Gaudi2</td>
<td>LLM on Gaudi2</td>
Expand All @@ -157,8 +157,12 @@ The initially supported `Microservices` are described in the below table. More `
<td>LLM on Xeon CPU</td>
</tr>
<tr>
<td><a href="https://huggingface.co/mistralai/Mistral-7B-v0.1">mistralai/Mistral-7B-v0.1</a></td>
<td><a href="https://github.com/vllm-project/vllm/">vLLM</a></td>
<td rowspan="2"><a href="https://huggingface.co/Intel/neural-chat-7b-v3-3">Intel/neural-chat-7b-v3-3</a></td>
<td rowspan="2"><a href="https://github.com/vllm-project/vllm/">vLLM</a></td>
<td>Gaudi2</td>
<td>LLM on Gaudi2</td>
</tr>
<tr>
<td>Xeon</td>
<td>LLM on Xeon CPU</td>
</tr>
Expand Down
17 changes: 10 additions & 7 deletions comps/llms/text-generation/vllm/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
# vLLM Endpoint Serve

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html), Gaudi accelerators support will be added soon. This guide provides an example on how to launch vLLM serving endpoint on CPU.
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.

## Getting Started

### Launch vLLM CPU Service
### Launch vLLM Service

#### Launch a local server instance:

```bash
bash ./serving/vllm/launch_vllm_service.sh
```

The `./serving/vllm/launch_vllm_service.sh` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.

For gated models such as `LLAMA-2`, you will have to pass -e HF_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.

Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
Expand All @@ -33,16 +35,17 @@ curl http://127.0.0.1:8080/v1/completions \
}'
```

#### Customize vLLM CPU Service
#### Customize vLLM Service

The `./serving/vllm/launch_vllm_service.sh` script accepts two parameters:
The `./serving/vllm/launch_vllm_service.sh` script accepts three parameters:

- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080.
- model_name: The model name utilized for LLM, with the default set to "mistralai/Mistral-7B-v0.1".
- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu"

You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:

```bash
export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
export LLM_MODEL=<model_name> # example: export LLM_MODEL="mistralai/Mistral-7B-v0.1"
export LLM_MODEL=<model_name> # example: export LLM_MODEL="Intel/neural-chat-7b-v3-3"
```
38 changes: 38 additions & 0 deletions comps/llms/text-generation/vllm/build_docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/bin/bash

# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set default values
default_hw_mode="cpu"

# Assign arguments to variable
hw_mode=${1:-$default_hw_mode}

# Check if all required arguments are provided
if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
echo "Usage: $0 [hw_mode]"
echo "Please customize the arguments you want to use.
- hw_mode: The hardware mode for the Ray Gaudi endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'hpu'."
exit 1
fi

# Build the docker image for vLLM based on the hardware mode
if [ "$hw_mode" = "hpu" ]; then
docker build -f docker/Dockerfile.hpu -t vllm:hpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
else
git clone https://github.com/vllm-project/vllm.git
cd ./vllm/
docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
fi
9 changes: 0 additions & 9 deletions comps/llms/text-generation/vllm/build_docker_cpu.sh

This file was deleted.

20 changes: 20 additions & 0 deletions comps/llms/text-generation/vllm/docker/Dockerfile.hpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

ENV LANG=en_US.UTF-8

WORKDIR /root

RUN pip install --upgrade-strategy eager optimum[habana]

RUN pip install -v git+https://github.com/HabanaAI/vllm-fork.git@ae3d6121

RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
service ssh restart

ENV no_proxy=localhost,127.0.0.1

ENV PT_HPU_LAZY_ACC_PAR_MODE=0

ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true

CMD ["/bin/bash"]
19 changes: 14 additions & 5 deletions comps/llms/text-generation/vllm/launch_vllm_service.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,29 @@

# Set default values
default_port=8080
default_model="mistralai/Mistral-7B-v0.1"
default_hw_mode="cpu"
default_model="Intel/neural-chat-7b-v3-3"

# Assign arguments to variables
port_number=${1:-$default_port}
model_name=${2:-$default_model}
hw_mode=${3:-$default_hw_mode}

# Check if all required arguments are provided
if [ "$#" -lt 0 ] || [ "$#" -gt 2 ]; then
echo "Usage: $0 [port_number] [model_name]"
if [ "$#" -lt 0 ] || [ "$#" -gt 3 ]; then
echo "Usage: $0 [port_number] [model_name] [hw_mode]"
echo "port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080."
echo "model_name: The model name utilized for LLM, with the default set to 'Intel/neural-chat-7b-v3-3'."
echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
exit 1
fi

# Set the volume variable
volume=$PWD/data

# Build the Docker run command based on the number of cards
docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port $port_number"
# Build the Docker run command based on hardware mode
if [ "$hw_mode" = "hpu" ]; then
docker run -it --runtime=habana --rm --name="ChatQnA_server" -p $port_number:$port_number -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number"
else
docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number"
fi
Loading