diff --git a/README.md b/README.md index a77afa704..51090259a 100644 --- a/README.md +++ b/README.md @@ -134,8 +134,8 @@ The initially supported `Microservices` are described in the below table. More ` Dataprep on Xeon CPU - LLM - LangChain + LLM + LangChain Intel/neural-chat-7b-v3-3 TGI Gaudi Gaudi2 @@ -147,7 +147,7 @@ The initially supported `Microservices` are described in the below table. More ` LLM on Xeon CPU - meta-llama/Llama-2-7b-chat-hf + Intel/neural-chat-7b-v3-3 Ray Serve Gaudi2 LLM on Gaudi2 @@ -157,8 +157,12 @@ The initially supported `Microservices` are described in the below table. More ` LLM on Xeon CPU - mistralai/Mistral-7B-v0.1 - vLLM + Intel/neural-chat-7b-v3-3 + vLLM + Gaudi2 + LLM on Gaudi2 + + Xeon LLM on Xeon CPU diff --git a/comps/llms/text-generation/vllm/README.md b/comps/llms/text-generation/vllm/README.md index af5343da3..338631552 100644 --- a/comps/llms/text-generation/vllm/README.md +++ b/comps/llms/text-generation/vllm/README.md @@ -1,10 +1,10 @@ # vLLM Endpoint Serve -[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html), Gaudi accelerators support will be added soon. This guide provides an example on how to launch vLLM serving endpoint on CPU. +[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). This guide provides an example on how to launch vLLM serving endpoint on CPU and Gaudi accelerators. ## Getting Started -### Launch vLLM CPU Service +### Launch vLLM Service #### Launch a local server instance: @@ -12,6 +12,8 @@ bash ./serving/vllm/launch_vllm_service.sh ``` +The `./serving/vllm/launch_vllm_service.sh` accepts one parameter `hw_mode` to specify the hardware mode of the service, with the default being `cpu`, and the optional selection can be `hpu`. + For gated models such as `LLAMA-2`, you will have to pass -e HF_TOKEN=\ to the docker run command above with a valid Hugging Face Hub read token. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HF_TOKEN` environment with the token. @@ -33,16 +35,17 @@ curl http://127.0.0.1:8080/v1/completions \ }' ``` -#### Customize vLLM CPU Service +#### Customize vLLM Service -The `./serving/vllm/launch_vllm_service.sh` script accepts two parameters: +The `./serving/vllm/launch_vllm_service.sh` script accepts three parameters: - port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080. -- model_name: The model name utilized for LLM, with the default set to "mistralai/Mistral-7B-v0.1". +- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3". +- hw_mode: The hardware mode utilized for LLM, with the default set to "cpu", and the optional selection can be "hpu" -You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`: +You have the flexibility to customize two parameters according to your specific needs. Additionally, you can set the vLLM endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`: ```bash export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8080" -export LLM_MODEL= # example: export LLM_MODEL="mistralai/Mistral-7B-v0.1" +export LLM_MODEL= # example: export LLM_MODEL="Intel/neural-chat-7b-v3-3" ``` diff --git a/comps/llms/text-generation/vllm/build_docker.sh b/comps/llms/text-generation/vllm/build_docker.sh new file mode 100644 index 000000000..3680f076c --- /dev/null +++ b/comps/llms/text-generation/vllm/build_docker.sh @@ -0,0 +1,38 @@ +#!/bin/bash + +# Copyright (c) 2024 Intel Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Set default values +default_hw_mode="cpu" + +# Assign arguments to variable +hw_mode=${1:-$default_hw_mode} + +# Check if all required arguments are provided +if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then + echo "Usage: $0 [hw_mode]" + echo "Please customize the arguments you want to use. + - hw_mode: The hardware mode for the Ray Gaudi endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'hpu'." + exit 1 +fi + +# Build the docker image for vLLM based on the hardware mode +if [ "$hw_mode" = "hpu" ]; then + docker build -f docker/Dockerfile.hpu -t vllm:hpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy +else + git clone https://github.com/vllm-project/vllm.git + cd ./vllm/ + docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy +fi diff --git a/comps/llms/text-generation/vllm/build_docker_cpu.sh b/comps/llms/text-generation/vllm/build_docker_cpu.sh deleted file mode 100644 index 487c4221b..000000000 --- a/comps/llms/text-generation/vllm/build_docker_cpu.sh +++ /dev/null @@ -1,9 +0,0 @@ -#!/bin/bash - - -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -git clone https://github.com/vllm-project/vllm.git -cd ./vllm/ -docker build -f Dockerfile.cpu -t vllm:cpu --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy diff --git a/comps/llms/text-generation/vllm/docker/Dockerfile.hpu b/comps/llms/text-generation/vllm/docker/Dockerfile.hpu new file mode 100644 index 000000000..430cf4641 --- /dev/null +++ b/comps/llms/text-generation/vllm/docker/Dockerfile.hpu @@ -0,0 +1,20 @@ +FROM vault.habana.ai/gaudi-docker/1.16.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest + +ENV LANG=en_US.UTF-8 + +WORKDIR /root + +RUN pip install --upgrade-strategy eager optimum[habana] + +RUN pip install -v git+https://github.com/HabanaAI/vllm-fork.git@ae3d6121 + +RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \ + service ssh restart + +ENV no_proxy=localhost,127.0.0.1 + +ENV PT_HPU_LAZY_ACC_PAR_MODE=0 + +ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true + +CMD ["/bin/bash"] \ No newline at end of file diff --git a/comps/llms/text-generation/vllm/launch_vllm_service.sh b/comps/llms/text-generation/vllm/launch_vllm_service.sh index c6fc04210..7e32c8775 100644 --- a/comps/llms/text-generation/vllm/launch_vllm_service.sh +++ b/comps/llms/text-generation/vllm/launch_vllm_service.sh @@ -6,20 +6,29 @@ # Set default values default_port=8080 -default_model="mistralai/Mistral-7B-v0.1" +default_hw_mode="cpu" +default_model="Intel/neural-chat-7b-v3-3" # Assign arguments to variables port_number=${1:-$default_port} model_name=${2:-$default_model} +hw_mode=${3:-$default_hw_mode} # Check if all required arguments are provided -if [ "$#" -lt 0 ] || [ "$#" -gt 2 ]; then - echo "Usage: $0 [port_number] [model_name]" +if [ "$#" -lt 0 ] || [ "$#" -gt 3 ]; then + echo "Usage: $0 [port_number] [model_name] [hw_mode]" + echo "port_number: The port number assigned to the vLLM CPU endpoint, with the default being 8080." + echo "model_name: The model name utilized for LLM, with the default set to 'Intel/neural-chat-7b-v3-3'." + echo "hw_mode: The hardware mode utilized for LLM, with the default set to 'cpu', and the optional selection can be 'hpu'" exit 1 fi # Set the volume variable volume=$PWD/data -# Build the Docker run command based on the number of cards -docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model $model_name --host 0.0.0.0 --port $port_number" +# Build the Docker run command based on hardware mode +if [ "$hw_mode" = "hpu" ]; then + docker run -it --runtime=habana --rm --name="ChatQnA_server" -p $port_number:$port_number -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:hpu /bin/bash -c "export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number" +else + docker run -it --rm --name="ChatQnA_server" -p $port_number:$port_number --network=host -v $volume:/data -e HTTPS_PROXY=$https_proxy -e HTTP_PROXY=$https_proxy -e HF_TOKEN=${HF_TOKEN} vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model $model_name --host 0.0.0.0 --port $port_number" +fi