From 18092f3dec2e173c180c4d1b1742bf34dbc2ed73 Mon Sep 17 00:00:00 2001 From: Harsha Ramayanam Date: Fri, 13 Sep 2024 22:52:56 -0700 Subject: [PATCH] Changes to comps/llms/text-generation/README (#678) * Added changes per dbkinder's review Signed-off-by: Harsha Ramayanam * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Harsha Ramayanam Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --- comps/llms/text-generation/README.md | 320 +++++++++++++++++++-------- 1 file changed, 230 insertions(+), 90 deletions(-) diff --git a/comps/llms/text-generation/README.md b/comps/llms/text-generation/README.md index 18897572a..9c4af98c1 100644 --- a/comps/llms/text-generation/README.md +++ b/comps/llms/text-generation/README.md @@ -6,108 +6,149 @@ A prerequisite for using this microservice is that users must have a LLM text ge Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM/Ray service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses. -## 🚀1. Start Microservice with Python (Option 1) +## Validated LLM Models -To start the LLM microservice, you need to install python packages first. +| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | Ray | +| --------------------------- | --------- | -------- | ---------- | --- | +| [Intel/neural-chat-7b-v3-3] | ✓ | ✓ | ✓ | ✓ | +| [Llama-2-7b-chat-hf] | ✓ | ✓ | ✓ | ✓ | +| [Llama-2-70b-chat-hf] | ✓ | - | ✓ | x | +| [Meta-Llama-3-8B-Instruct] | ✓ | ✓ | ✓ | ✓ | +| [Meta-Llama-3-70B-Instruct] | ✓ | - | ✓ | x | +| [Phi-3] | x | Limit 4K | Limit 4K | ✓ | -### 1.1 Install Requirements +## Clone OPEA GenAIComps + +Clone this repository at your desired location and set an environment variable for easy setup and usage throughout the instructions. ```bash -pip install -r requirements.txt +git clone https://github.com/opea-project/GenAIComps.git + +export OPEA_GENAICOMPS_ROOT=$(pwd)/GenAIComps ``` -### 1.2 Start LLM Service +## 🚀1. Start Microservice with Python (Option 1) -#### 1.2.1 Start TGI Service +To start the LLM microservice, you need to install python packages first. + +### 1.1 Install Requirements ```bash -export HF_TOKEN=${your_hf_api_token} -docker run -p 8008:80 -v ./data:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:1.4 --model-id ${your_hf_llm_model} -``` +pip install opea-comps +pip install -r ${OPEA_GENAICOMPS_ROOT}/comps/llms/requirements.txt -#### 1.2.2 Start vLLM Service +# Install requirements of your choice of microservice in the text-generation folder (tgi, vllm, vllm-ray, etc.) +export MICROSERVICE_DIR=your_chosen_microservice -```bash -export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} -docker run -it --name vllm_service -p 8008:80 -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -v ./data:/data opea/vllm:cpu /bin/bash -c "cd / && export VLLM_CPU_KVCACHE_SPACE=40 && python3 -m vllm.entrypoints.openai.api_server --model ${your_hf_llm_model} --port 80" +pip install -r ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/${MICROSERVICE_DIR}/requirements.txt ``` -### 1.2.3 Start Ray Service +Set an environment variable `your_ip` to the IP address of the machine where you would like to consume the microservice. ```bash -export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} -export TRUST_REMOTE_CODE=True -docker run -it --runtime=habana --name ray_serve_service -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -p 8008:80 -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE opea/llm-ray:latest /bin/bash -c "ray start --head && python api_server_openai.py --port_number 80 --model_id_or_path ${your_hf_llm_model} --chat_processor ${your_hf_chatprocessor}" +# For example, this command would set the IP address of your currently logged-in machine. +export your_ip=$(hostname -I | awk '{print $1}') ``` -### 1.3 Verify the LLM Service +### 1.2 Start LLM Service with Python Script -#### 1.3.1 Verify the TGI Service +#### 1.2.1 Start the TGI Service ```bash -curl http://${your_ip}:8008/generate \ - -X POST \ - -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \ - -H 'Content-Type: application/json' +export TGI_LLM_ENDPOINT="http://${your_ip}:8008" +python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/tgi/llm.py +python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/tgi/llm.py ``` -#### 1.3.2 Verify the vLLM Service +#### 1.2.2 Start the vLLM Service ```bash -curl http://${your_ip}:8008/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": ${your_hf_llm_model}, - "prompt": "What is Deep Learning?", - "max_tokens": 32, - "temperature": 0 - }' +export vLLM_LLM_ENDPOINT="http://${your_ip}:8008" +python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/llm.py +python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/llm.py ``` -#### 1.3.3 Verify the Ray Service +#### 1.2.3 Start the Ray Service ```bash -curl http://${your_ip}:8008/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": ${your_hf_llm_model}, - "messages": [ - {"role": "assistant", "content": "You are a helpful assistant."}, - {"role": "user", "content": "What is Deep Learning?"}, - ], - "max_tokens": 32, - "stream": True - }' +export RAY_Serve_ENDPOINT="http://${your_ip}:8008" +python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/ray_serve/llm.py +python ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/ray_serve/llm.py ``` -### 1.4 Start LLM Service with Python Script +## 🚀2. Start Microservice with Docker (Option 2) + +You can use either a published docker image or build your own docker image with the respective microservice Dockerfile of your choice. You must create a user account with [HuggingFace] and obtain permission to use the restricted LLM models by adhering to the guidelines provided on the respective model's webpage. -#### 1.4.1 Start the TGI Service +### 2.1 Start LLM Service with published image + +#### 2.1.1 Start TGI Service ```bash -export TGI_LLM_ENDPOINT="http://${your_ip}:8008" -python text-generation/tgi/llm.py +export HF_LLM_MODEL=${your_hf_llm_model} +export HF_TOKEN=${your_hf_api_token} + +docker run \ + -p 8008:80 \ + -e HF_TOKEN=${HF_TOKEN} \ + -v ./data:/data \ + --name tgi_service \ + --shm-size 1g \ + ghcr.io/huggingface/text-generation-inference:1.4 \ + --model-id ${HF_LLM_MODEL} ``` -#### 1.4.2 Start the vLLM Service +#### 2.1.2 Start vLLM Service ```bash -export vLLM_LLM_ENDPOINT="http://${your_ip}:8008" -python text-generation/vllm/llm.py +# Use the script to build the docker image as opea/vllm:cpu +bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/build_docker_vllm.sh cpu + +export HF_LLM_MODEL=${your_hf_llm_model} +export HF_TOKEN=${your_hf_api_token} + +docker run -it \ + --name vllm_service \ + -p 8008:80 \ + -e HF_TOKEN=${HF_TOKEN} \ + -e VLLM_CPU_KVCACHE_SPACE=40 \ + -v ./data:/data \ + opea/vllm:cpu \ + --model ${HF_LLM_MODEL} + --port 80 ``` -#### 1.4.3 Start the Ray Service +#### 2.1.3 Start Ray Service ```bash -export RAY_Serve_ENDPOINT="http://${your_ip}:8008" -python text-generation/ray_serve/llm.py +export HF_LLM_MODEL=${your_hf_llm_model} +export HF_CHAT_PROCESSOR=${your_hf_chatprocessor} +export HF_TOKEN=${your_hf_api_token} +export TRUST_REMOTE_CODE=True + +docker run -it \ + --runtime=habana \ + --name ray_serve_service \ + -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ + --cap-add=sys_nice \ + --ipc=host \ + -p 8008:80 \ + -e HF_TOKEN=$HF_TOKEN \ + -e TRUST_REMOTE_CODE=$TRUST_REMOTE_CODE \ + opea/llm-ray:latest \ + /bin/bash -c " \ + ray start --head && \ + python api_server_openai.py \ + --port_number 80 \ + --model_id_or_path ${HF_LLM_MODEL} \ + --chat_processor ${HF_CHAT_PROCESSOR}" ``` -## 🚀2. Start Microservice with Docker (Option 2) +### 2.2 Start LLM Service with image built from source If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a TGI/vLLM service with docker. -### 2.1 Setup Environment Variables +#### 2.2.1 Setup Environment Variables In order to start TGI and LLM services, you need to setup the following environment variables first. @@ -120,7 +161,7 @@ export LLM_MODEL_ID=${your_hf_llm_model} In order to start vLLM and LLM services, you need to setup the following environment variables first. ```bash -export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} +export HF_TOKEN=${your_hf_api_token} export vLLM_LLM_ENDPOINT="http://${your_ip}:8008" export LLM_MODEL_ID=${your_hf_llm_model} ``` @@ -128,7 +169,7 @@ export LLM_MODEL_ID=${your_hf_llm_model} In order to start Ray serve and LLM services, you need to setup the following environment variables first. ```bash -export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} +export HF_TOKEN=${your_hf_api_token} export RAY_Serve_ENDPOINT="http://${your_ip}:8008" export LLM_MODEL=${your_hf_llm_model} export CHAT_PROCESSOR="ChatModelLlama" @@ -139,8 +180,13 @@ export CHAT_PROCESSOR="ChatModelLlama" #### 2.2.1 TGI ```bash -cd ../../../ -docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile . +cd ${OPEA_GENAICOMPS_ROOT} + +docker build \ + -t opea/llm-tgi:latest \ + --build-arg https_proxy=$https_proxy \ + --build-arg http_proxy=$http_proxy \ + -f comps/llms/text-generation/tgi/Dockerfile . ``` #### 2.2.2 vLLM @@ -148,15 +194,19 @@ docker build -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build Build vllm docker. ```bash -cd text-generation/vllm/langchain/dependency -bash build_docker_vllm.sh +bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain/dependency/build_docker_vllm.sh ``` Build microservice docker. ```bash -cd ../../../ -docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/langchain/Dockerfile . +cd ${OPEA_GENAICOMPS_ROOT} + +docker build \ + -t opea/llm-vllm:latest \ + --build-arg https_proxy=$https_proxy \ + --build-arg http_proxy=$http_proxy \ + -f comps/llms/text-generation/vllm/langchain/Dockerfile . ``` #### 2.2.3 Ray Serve @@ -164,15 +214,19 @@ docker build -t opea/llm-vllm:latest --build-arg https_proxy=$https_proxy --buil Build Ray Serve docker. ```bash -cd text-generation/vllm/ray/dependency -bash build_docker_vllmray.sh +bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh ``` Build microservice docker. ```bash -cd ../../../ -docker build -t opea/llm-ray:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/vllm/ray/Dockerfile . +cd ${OPEA_GENAICOMPS_ROOT} + +docker build \ + -t opea/llm-ray:latest \ + --build-arg https_proxy=$https_proxy \ + --build-arg http_proxy=$http_proxy \ + -f comps/llms/text-generation/vllm/ray/Dockerfile . ``` To start a docker container, you have two options: @@ -187,7 +241,15 @@ You can choose one as needed. #### 2.3.1 TGI ```bash -docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-tgi:latest +docker run -d \ + --name="llm-tgi-server" \ + -p 9000:9000 \ + --ipc=host \ + -e http_proxy=$http_proxy \ + -e https_proxy=$https_proxy \ + -e TGI_LLM_ENDPOINT=$TGI_LLM_ENDPOINT \ + -e HF_TOKEN=$HF_TOKEN \ + opea/llm-tgi:latest ``` #### 2.3.2 vLLM @@ -195,13 +257,23 @@ docker run -d --name="llm-tgi-server" -p 9000:9000 --ipc=host -e http_proxy=$htt Start vllm endpoint. ```bash -bash launch_vllm_service.sh +bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain/dependency/launch_vllm_service.sh ``` Start vllm microservice. ```bash -docker run --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=${no_proxy} -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL_ID=$LLM_MODEL_ID opea/llm-vllm:latest +docker run \ + --name="llm-vllm-server" \ + -p 9000:9000 \ + --ipc=host \ + -e http_proxy=$http_proxy \ + -e https_proxy=$https_proxy \ + -e no_proxy=${no_proxy} \ + -e vLLM_LLM_ENDPOINT=$vLLM_LLM_ENDPOINT \ + -e HF_TOKEN=$HF_TOKEN \ + -e LLM_MODEL_ID=$LLM_MODEL_ID \ + opea/llm-vllm:latest ``` #### 2.3.3 Ray Serve @@ -209,13 +281,22 @@ docker run --name="llm-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_ Start Ray Serve endpoint. ```bash -bash launch_ray_service.sh +bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/launch_vllmray.sh ``` Start Ray Serve microservice. ```bash -docker run -d --name="llm-ray-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN -e LLM_MODEL=$LLM_MODEL opea/llm-ray:latest +docker run -d \ + --name="llm-ray-server" \ + -p 9000:9000 \ + --ipc=host \ + -e http_proxy=$http_proxy \ + -e https_proxy=$https_proxy \ + -e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT \ + -e HF_TOKEN=$HF_TOKEN \ + -e LLM_MODEL=$LLM_MODEL \ + opea/llm-ray:latest ``` ### 2.4 Run Docker with Docker Compose (Option B) @@ -223,21 +304,21 @@ docker run -d --name="llm-ray-server" -p 9000:9000 --ipc=host -e http_proxy=$htt #### 2.4.1 TGI ```bash -cd text-generation/tgi +cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/tgi docker compose -f docker_compose_llm.yaml up -d ``` #### 2.4.2 vLLM ```bash -cd text-generation/vllm/langchain +cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain docker compose -f docker_compose_llm.yaml up -d ``` #### 2.4.3 Ray Serve ```bash -cd text-genetation/vllm/ray +cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray docker compose -f docker_compose_llm.yaml up -d ``` @@ -251,7 +332,47 @@ curl http://${your_ip}:9000/v1/health_check\ -H 'Content-Type: application/json' ``` -### 3.2 Consume LLM Service +### 3.2 Verify the LLM Service + +#### 3.2.1 Verify the TGI Service + +```bash +curl http://${your_ip}:8008/generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \ + -H 'Content-Type: application/json' +``` + +#### 3.2.2 Verify the vLLM Service + +```bash +curl http://${your_ip}:8008/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": ${your_hf_llm_model}, + "prompt": "What is Deep Learning?", + "max_tokens": 32, + "temperature": 0 + }' +``` + +#### 3.2.3 Verify the Ray Service + +```bash +curl http://${your_ip}:8008/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": ${your_hf_llm_model}, + "messages": [ + {"role": "assistant", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is Deep Learning?"}, + ], + "max_tokens": 32, + "stream": True + }' +``` + +### 3.3 Consume LLM Service You can set the following model parameters according to your actual needs, such as `max_new_tokens`, `streaming`. @@ -261,23 +382,42 @@ The `streaming` parameter determines the format of the data returned by the API. # non-streaming mode curl http://${your_ip}:9000/v1/chat/completions \ -X POST \ - -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":false}' \ - -H 'Content-Type: application/json' + -H 'Content-Type: application/json' \ + -d '{ + "query":"What is Deep Learning?", + "max_new_tokens":17, + "top_k":10, + "top_p":0.95, + "typical_p":0.95, + "temperature":0.01, + "repetition_penalty":1.03, + "streaming":false + }' + # streaming mode curl http://${your_ip}:9000/v1/chat/completions \ -X POST \ - -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \ - -H 'Content-Type: application/json' + -H 'Content-Type: application/json' \ + -d '{ + "query":"What is Deep Learning?", + "max_new_tokens":17, + "top_k":10, + "top_p":0.95, + "typical_p":0.95, + "temperature":0.01, + "repetition_penalty":1.03, + "streaming":true + }' + ``` -### 4. Validated Model + -| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | Ray | -| ------------------------- | --------- | -------- | ---------- | --- | -| Intel/neural-chat-7b-v3-3 | ✓ | ✓ | ✓ | ✓ | -| Llama-2-7b-chat-hf | ✓ | ✓ | ✓ | ✓ | -| Llama-2-70b-chat-hf | ✓ | - | ✓ | x | -| Meta-Llama-3-8B-Instruct | ✓ | ✓ | ✓ | ✓ | -| Meta-Llama-3-70B-Instruct | ✓ | - | ✓ | x | -| Phi-3 | x | Limit 4K | Limit 4K | ✓ | +[Intel/neural-chat-7b-v3-3]: https://huggingface.co/Intel/neural-chat-7b-v3-3 +[Llama-2-7b-chat-hf]: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf +[Llama-2-70b-chat-hf]: https://huggingface.co/meta-llama/Llama-2-70b-chat-hf +[Meta-Llama-3-8B-Instruct]: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct +[Meta-Llama-3-70B-Instruct]: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct +[Phi-3]: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3 +[HuggingFace]: https://huggingface.co/