diff --git a/README.md b/README.md
index a77afa704..51090259a 100644
--- a/README.md
+++ b/README.md
@@ -134,8 +134,8 @@ The initially supported `Microservices` are described in the below table. More `
Dataprep on Xeon CPU |
- LLM |
- LangChain |
+ LLM |
+ LangChain |
Intel/neural-chat-7b-v3-3 |
TGI Gaudi |
Gaudi2 |
@@ -147,7 +147,7 @@ The initially supported `Microservices` are described in the below table. More `
LLM on Xeon CPU |
- meta-llama/Llama-2-7b-chat-hf |
+ Intel/neural-chat-7b-v3-3 |
Ray Serve |
Gaudi2 |
LLM on Gaudi2 |
@@ -157,8 +157,12 @@ The initially supported `Microservices` are described in the below table. More `
LLM on Xeon CPU |
- mistralai/Mistral-7B-v0.1 |
- vLLM |
+ Intel/neural-chat-7b-v3-3 |
+ vLLM |
+ Gaudi2 |
+ LLM on Gaudi2 |
+
+
Xeon |
LLM on Xeon CPU |
diff --git a/comps/llms/text-generation/vllm/README.md b/comps/llms/text-generation/vllm/README.md
index af5343da3..338631552 100644
--- a/comps/llms/text-generation/vllm/README.md
+++ b/comps/llms/text-generation/vllm/README.md
@@ -1,10 +1,10 @@
# vLLM Endpoint Serve
-[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html), Gaudi accelerators support will be added soon. This guide provides an example on how to launch vLLM serving endpoint on CPU.
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators.
## Getting Started
-### Launch vLLM CPU Service
+### Launch vLLM Service
#### Launch a local server instance:
@@ -12,6 +12,8 @@
bash ./serving/vllm/launch_vllm_service.sh
```
+The `./serving/vllm/launch_vllm_service.sh` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`.
+
For gated models such as `LLAMA-2`, you will have to pass -e HF_TOKEN=\ to the docker run command above with a valid Hugging Face Hub read token.
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token.
@@ -33,16 +35,17 @@ curl http://127.0.0.1:8080/v1/completions \
}'
```
-#### Customize vLLM CPU Service
+#### Customize vLLM Service
-The `./serving/vllm/launch_vllm_service.sh` script accepts two parameters:
+The `./serving/vllm/launch_vllm_service.sh` script accepts three parameters:
- port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080.
-- model_name: The model name utilized for LLM, with the default set to "mistralai/Mistral-7B-v0.1".
+- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".
+- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu"
-You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
+You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:
```bash
export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
-export LLM_MODEL= # example: export LLM_MODEL="mistralai/Mistral-7B-v0.1"
+export LLM_MODEL= # example: export LLM_MODEL="Intel/neural-chat-7b-v3-3"
```
diff --git a/comps/llms/text-generation/vllm/build_docker.sh b/comps/llms/text-generation/vllm/build_docker.sh
new file mode 100644
index 000000000..3680f076c
--- /dev/null
+++ b/comps/llms/text-generation/vllm/build_docker.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# Copyright (c) 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Set default values
+default_hw_mode="cpu"
+
+# Assign arguments to variable
+hw_mode=${1:-$default_hw_mode}
+
+# Check if all required arguments are provided
+if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
+ echo "Usage: $0 [hw_mode]"
+ echo "Please customize the arguments you want to use.
+ - hw_mode: The hardware mode for the Ray Gaudi endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'hpu'."
+ exit 1
+fi
+
+# Build the docker image for vLLM based on the hardware mode
+if [ "$hw_mode" = "hpu" ]; then
+ docker build -f docker/Dockerfile.hpu -t vllm:hpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+else
+ git clone https://github.com/vllm-project/vllm.git
+ cd ./vllm/
+ docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+fi
diff --git a/comps/llms/text-generation/vllm/build_docker_cpu.sh b/comps/llms/text-generation/vllm/build_docker_cpu.sh
deleted file mode 100644
index 487c4221b..000000000
--- a/comps/llms/text-generation/vllm/build_docker_cpu.sh
+++ /dev/null
@@ -1,9 +0,0 @@
-#!/bin/bash
-
-
-# Copyright (C) 2024 Intel Corporation
-# SPDX-License-Identifier: Apache-2.0
-
-git clone https://github.com/vllm-project/vllm.git
-cd ./vllm/
-docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
diff --git a/comps/llms/text-generation/vllm/docker/Dockerfile.hpu b/comps/llms/text-generation/vllm/docker/Dockerfile.hpu
new file mode 100644
index 000000000..430cf4641
--- /dev/null
+++ b/comps/llms/text-generation/vllm/docker/Dockerfile.hpu
@@ -0,0 +1,20 @@
+FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
+
+ENV LANG=en_US.UTF-8
+
+WORKDIR /root
+
+RUN pip install --upgrade-strategy eager optimum[habana]
+
+RUN pip install -v git+https://github.com/HabanaAI/vllm-fork.git@ae3d6121
+
+RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
+ service ssh restart
+
+ENV no_proxy=localhost,127.0.0.1
+
+ENV PT_HPU_LAZY_ACC_PAR_MODE=0
+
+ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
+
+CMD ["/bin/bash"]
\ No newline at end of file
diff --git a/comps/llms/text-generation/vllm/launch_vllm_service.sh b/comps/llms/text-generation/vllm/launch_vllm_service.sh
index c6fc04210..7e32c8775 100644
--- a/comps/llms/text-generation/vllm/launch_vllm_service.sh
+++ b/comps/llms/text-generation/vllm/launch_vllm_service.sh
@@ -6,20 +6,29 @@
# Set default values
default_port=8080
-default_model="mistralai/Mistral-7B-v0.1"
+default_hw_mode="cpu"
+default_model="Intel/neural-chat-7b-v3-3"
# Assign arguments to variables
port_number=${1:-$default_port}
model_name=${2:-$default_model}
+hw_mode=${3:-$default_hw_mode}
# Check if all required arguments are provided
-if [ "$#" -lt 0 ] || [ "$#" -gt 2 ]; then
- echo "Usage: $0 [port_number] [model_name]"
+if [ "$#" -lt 0 ] || [ "$#" -gt 3 ]; then
+ echo "Usage: $0 [port_number] [model_name] [hw_mode]"
+ echo "port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080."
+ echo "model_name: The model name utilized for LLM, with the default set to 'Intel/neural-chat-7b-v3-3'."
+ echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'"
exit 1
fi
# Set the volume variable
volume=$PWD/data
-# Build the Docker run command based on the number of cards
-docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port $port_number"
+# Build the Docker run command based on hardware mode
+if [ "$hw_mode" = "hpu" ]; then
+ docker run -it --runtime=habana --rm --name="ChatQnA_server" -p $port_number:$port_number -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number"
+else
+ docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number"
+fi