diff --git a/.github/workflows/docker/compose/llms-compose-cd.yaml b/.github/workflows/docker/compose/llms-compose-cd.yaml index 7dff6d5c6..2e8138f0e 100644 --- a/.github/workflows/docker/compose/llms-compose-cd.yaml +++ b/.github/workflows/docker/compose/llms-compose-cd.yaml @@ -15,6 +15,10 @@ services: context: vllm-openvino dockerfile: Dockerfile.openvino image: ${REGISTRY:-opea}/vllm-openvino:${TAG:-latest} + vllm-arc: + build: + dockerfile: comps/llms/text-generation/vllm/langchain/dependency/Dockerfile.intel_gpu + image: ${REGISTRY:-opea}/vllm-arc:${TAG:-latest} llm-eval: build: dockerfile: comps/llms/utils/lm-eval/Dockerfile diff --git a/comps/llms/text-generation/vllm/langchain/README.md b/comps/llms/text-generation/vllm/langchain/README.md index 6f41b9fe0..89159356f 100644 --- a/comps/llms/text-generation/vllm/langchain/README.md +++ b/comps/llms/text-generation/vllm/langchain/README.md @@ -98,16 +98,16 @@ For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use fo bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8 ``` -### 2.3 vLLM with OpenVINO +### 2.3 vLLM with OpenVINO (on Intel GPU and CPU) -vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features: +vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on Intel GPU and all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete IntelĀ® GPUs (starting from IntelĀ® UHD Graphics generation). OpenVINO vLLM backend supports the following advanced vLLM features: - Prefix caching (`--enable-prefix-caching`) - Chunked prefill (`--enable-chunked-prefill`) #### Build Docker Image -To build the docker image, run the command +To build the docker image for Intel CPU, run the command ```bash bash ./build_docker_vllm_openvino.sh @@ -115,6 +115,14 @@ bash ./build_docker_vllm_openvino.sh Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell. +To build the docker image for Intel GPU, run the command + +```bash +bash ./build_docker_vllm_openvino.sh gpu +``` + +Once it successfully builds, you will have the `opea/vllm-arc:latest` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell. + #### Launch vLLM service For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\ to the docker run command above with a valid Hugging Face Hub read token. @@ -125,14 +133,30 @@ Please follow this link [huggingface token](https://huggingface.co/docs/hub/secu export HUGGINGFACEHUB_API_TOKEN= ``` -To start the model server: +To start the model server for Intel CPU: ```bash bash launch_vllm_service_openvino.sh ``` +To start the model server for Intel GPU: + +```bash +bash launch_vllm_service_openvino.sh -d gpu +``` + #### Performance tips +--- + +vLLM OpenVINO backend environment variables + +- `VLLM_OPENVINO_DEVICE` to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, `VLLM_OPENVINO_DEVICE=GPU.1`). If the value is not specified, CPU device is used by default. + +- `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON` enables U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `` + +##### CPU performance tips + vLLM OpenVINO backend uses the following environment variables to control behavior: - `VLLM_OPENVINO_KVCACHE_SPACE` to specify the KV Cache size (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. @@ -148,6 +172,17 @@ OpenVINO best known configuration is: $ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \ python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256 +##### GPU performance tips + +GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account `gpu_memory_utilization` option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using `VLLM_OPENVINO_KVCACHE_SPACE` environment variable (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=8` means 8 GB space for KV cache). + +Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and `preemption-mode=swap`. + +OpenVINO best known configuration for GPU is: + + $ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \ + python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json + ### 2.4 Query the service And then you can make requests like below to check the service status: diff --git a/comps/llms/text-generation/vllm/langchain/dependency/Dockerfile.intel_gpu b/comps/llms/text-generation/vllm/langchain/dependency/Dockerfile.intel_gpu new file mode 100644 index 000000000..dfb94d2df --- /dev/null +++ b/comps/llms/text-generation/vllm/langchain/dependency/Dockerfile.intel_gpu @@ -0,0 +1,43 @@ +# The vLLM Dockerfile is used to construct vLLM image that can be directly used +# to run the OpenAI compatible server. +# Based on https://github.com/vllm-project/vllm/blob/main/Dockerfile.openvino +# add Intel ARC support package + +FROM ubuntu:22.04 AS dev + +RUN apt-get update -y && \ + apt-get install -y \ + git python3-pip \ + ffmpeg libsm6 libxext6 libgl1 \ + gpg-agent wget + +RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg && \ + echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | \ + tee /etc/apt/sources.list.d/intel-gpu-jammy.list &&\ + apt update -y &&\ + apt install -y \ + intel-opencl-icd intel-level-zero-gpu level-zero \ + intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \ + libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \ + libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \ + mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo + +WORKDIR /workspace + +RUN git clone -b v0.6.3.post1 https://github.com/vllm-project/vllm.git + +#ARG GIT_REPO_CHECK=0 +#RUN --mount=type=bind,source=.git,target=.git \ +# if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi + +# install build requirements +RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt +# build vLLM with OpenVINO backend +RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/ + +#COPY examples/ /workspace/vllm/examples +#COPY benchmarks/ /workspace/vllm/benchmarks + + +CMD ["/bin/bash"] + diff --git a/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm_openvino.sh b/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm_openvino.sh index 7384ac8f2..2640cf460 100644 --- a/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm_openvino.sh +++ b/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm_openvino.sh @@ -3,8 +3,27 @@ # Copyright (C) 2024 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -BASEDIR="$( cd "$( dirname "$0" )" && pwd )" -git clone https://github.com/vllm-project/vllm.git vllm -cd ./vllm/ && git checkout v0.6.1 -docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -cd $BASEDIR && rm -rf vllm +# Set default values +default_hw_mode="cpu" + +# Assign arguments to variable +hw_mode=${1:-$default_hw_mode} + +# Check if all required arguments are provided +if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then + echo "Usage: $0 [hw_mode]" + echo "Please customize the arguments you want to use. + - hw_mode: The hardware mode for the vLLM endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'gpu'." + exit 1 +fi + +# Build the docker image for vLLM based on the hardware mode +if [ "$hw_mode" = "gpu" ]; then + docker build -f Dockerfile.intel_gpu -t opea/vllm-arc:latest . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy +else + BASEDIR="$( cd "$( dirname "$0" )" && pwd )" + git clone https://github.com/vllm-project/vllm.git vllm + cd ./vllm/ && git checkout v0.6.1 + docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy + cd $BASEDIR && rm -rf vllm +fi diff --git a/comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service_openvino.sh b/comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service_openvino.sh index d54970877..140df6a0f 100644 --- a/comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service_openvino.sh +++ b/comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service_openvino.sh @@ -9,16 +9,20 @@ default_port=8008 default_model="meta-llama/Llama-2-7b-hf" +default_device="cpu" swap_space=50 +image="vllm:openvino" -while getopts ":hm:p:" opt; do +while getopts ":hm:p:d:" opt; do case $opt in h) - echo "Usage: $0 [-h] [-m model] [-p port]" + echo "Usage: $0 [-h] [-m model] [-p port] [-d device]" echo "Options:" echo " -h Display this help message" - echo " -m model Model (default: meta-llama/Llama-2-7b-hf)" + echo " -m model Model (default: meta-llama/Llama-2-7b-hf for cpu" + echo " meta-llama/Llama-3.2-3B-Instruct for gpu)" echo " -p port Port (default: 8000)" + echo " -d device Target Device (Default: cpu, optional selection can be 'cpu' and 'gpu')" exit 0 ;; m) @@ -27,6 +31,9 @@ while getopts ":hm:p:" opt; do p) port=$OPTARG ;; + d) + device=$OPTARG + ;; \?) echo "Invalid option: -$OPTARG" >&2 exit 1 @@ -37,25 +44,33 @@ done # Assign arguments to variables model_name=${model:-$default_model} port_number=${port:-$default_port} +device=${device:-$default_device} # Set the Huggingface cache directory variable HF_CACHE_DIR=$HOME/.cache/huggingface - +if [ "$device" = "gpu" ]; then + docker_args="-e VLLM_OPENVINO_DEVICE=GPU --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path" + vllm_args="--max_model_len=1024" + model_name="meta-llama/Llama-3.2-3B-Instruct" + image="opea/vllm-arc:latest" +fi # Start the model server using Openvino as the backend inference engine. # Provide the container name that is unique and meaningful, typically one that includes the model name. docker run -d --rm --name="vllm-openvino-server" \ -p $port_number:80 \ --ipc=host \ + $docker_args \ -e HTTPS_PROXY=$https_proxy \ -e HTTP_PROXY=$https_proxy \ -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} \ - -v $HOME/.cache/huggingface:/home/user/.cache/huggingface \ - vllm:openvino /bin/bash -c "\ + -v $HOME/.cache/huggingface:/root/.cache/huggingface \ + $image /bin/bash -c "\ cd / && \ export VLLM_CPU_KVCACHE_SPACE=50 && \ python3 -m vllm.entrypoints.openai.api_server \ --model \"$model_name\" \ + $vllm_args \ --host 0.0.0.0 \ --port 80" diff --git a/comps/llms/text-generation/vllm/langchain/query.sh b/comps/llms/text-generation/vllm/langchain/query.sh index 13b63511b..31fa18750 100644 --- a/comps/llms/text-generation/vllm/langchain/query.sh +++ b/comps/llms/text-generation/vllm/langchain/query.sh @@ -2,11 +2,12 @@ # SPDX-License-Identifier: Apache-2.0 your_ip="0.0.0.0" +model=$(curl http://localhost:8008/v1/models -s|jq -r '.data[].id') curl http://${your_ip}:8008/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "model": "'$model'", "prompt": "What is Deep Learning?", "max_tokens": 32, "temperature": 0