From 64fb97ad2aaca99512892c4fa0c2c6f0ce37ad2e Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 12 Oct 2023 20:21:46 -0700 Subject: [PATCH 01/13] initial commit --- Popular_Models_Guide/Llama2/trtllm_guide.md | 156 ++++++++++++++++++++ README.md | 12 ++ 2 files changed, 168 insertions(+) create mode 100644 Popular_Models_Guide/Llama2/trtllm_guide.md diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md new file mode 100644 index 00000000..64be6689 --- /dev/null +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -0,0 +1,156 @@ + + +Note: This tutorial is for TensorRT-LLM Backend which is currently under development. + +## Pre-build instructions + +For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. +Clone the repo of the model with weights and tokens [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main). You will need to get permissions for the Llama2 repository as well as get access to the huggingface cli. To get access to the huggingface cli, go here: [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). + +## Installation + +Launch Triton docker container with TensorRT-LLM backend +```docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash``` + +Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container. + +Don't forget to allow gpu usage when you launch the container. + +## Create Engines for each model [skip this step if you already have a engine] +TensorRT-LLM requires each model to be compiled for the configuration you need before running. +To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. +To do so, you will need to complete the following steps: + +1. Install Tensorrt-LLM python package + ```bash +# TensorRT-LLM is required for generating engines. +pip install git+https://github.com/NVIDIA/TensorRT-LLM.git +mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ +cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ +``` + +2. Log in to huggingface-cli + + ```bash + huggingface-cli login --token hf_***** + ``` + +3. Compile model (3 min) + + + + ```bash + python3 examples/llama/build.py \ + --model_dir meta-llama/Llama-2-7b-chat-hf \ + --dtype float16 \ + --use_gpt_attention_plugin float16 \ + --use_gemm_plugin float16 \ + --output_dir ../tensorrt_llm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1 \ + --world_size 1 + ``` + + +python build.py --model_dir /mnt/nvdl/usr/katheriney/Llama-2-7b-chat-hf/ \ + --dtype bfloat16 \ + --use_gpt_attention_plugin bfloat16 \ + --use_inflight_batching \ + --paged_kv_cache \ + --remove_input_padding \ + --use_gemm_plugin bfloat16 \ + --output_dir /mnt/nvdl/usr/katheriney/engines/bf16/1-gpu/ + + ```bash + python3 examples/llama/build.py + --model_dir /data/meta-llama/Llama-2-7b-chat-hf/ \ + --dtype float16 --use_gpt_attention_plugin bfloat16 \ + --use_gemm_plugin bfloat16 \ + --output_dir /data/meta-llama/gen/7B/trt_engines/bf16/1-gpu/ + ``` + + > Optional: You can check test the output of the model with the following command: + > + > ```bash + > python3 examples/llama/run.py --engine_dir=../tensorrt_llm_backend/all_models/gpt/tensorrt_llm/1/ --max_output_len 100 --tokenizer_dir meta-llama/Llama-2-7b-chat-hf --input_text "How do I count to ten in French?" + > ``` + +## Serving with Triton + +> Note: WIP, this part doesnt work yet because it uses the wrong tokenizer + +13. Launch Triton Docker container + + ```bash + cd .. + docker run -it --rm --gpus all --network host --shm-size=1g -v $(pwd)/all_models:/app/all_models triton_trt_llm + ``` + + + +14. Install TensorRT-LLM into container (6 sec) + + ```bash + pip install tensorrt_llm/build/tensorrt_llm-0.1.3-py3-none-any.whl + ``` + +15. Modify model config file + + + + ```bash + sed -i 's#${decoupled_mode}#true#' all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt + sed -i 's#${engine_dir}#/app/all_models/inflight_batcher_llm/tensorrt_llm/1#' all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt + ``` + +16. Launch Tritonserver (20 mins) + + ```bash + tritonserver --model-repository /app/all_models/gpt --log-verbose 1 + ``` + + ```bash + tritonserver --model-repository /app/all_models/inflight_batcher_llm --log-verbose 1 + ``` + +## Client + +WIP + + diff --git a/README.md b/README.md index 43b95b75..8d158e6f 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,18 @@ The focus of these examples is to demonstrate deployment for models trained with | [PyTorch Model](./Quick_Deploy/PyTorch/README.md) | [TensorFlow Model](./Quick_Deploy/TensorFlow/README.md) | [ONNX Model](./Quick_Deploy/ONNX/README.md) | [TensorRT Accelerated Model](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) | [vLLM Model](./Quick_Deploy/vLLM/README.md) | --------------- | ------------ | --------------- | --------------- | --------------- | +#### Supported Model Table +The table below contains a +| Model Name | Supported with HuggingFace format | Supported with TensorRT-LLM Backend | Supported with vLLM Backend | +| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | +| [Llama2-7B](https://ai.meta.com/llama/) | [Llama-2](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | :grey_question:| +| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) |:white_check_mark: |:grey_question: | :white_check_mark: | +| [Falcon-180B](https://falconllm.tii.ae/index.html) |:white_check_mark: |:grey_question: | :white_check_mark: | +| [Mistral-7B](https://mistral.ai/news/announcing-mistral-7b/)|:white_check_mark: |:grey_question: | :white_check_mark: | + + + + ## What does this repository contain? This repository contains the following resources: * [Conceptual Guide](./Conceptual_Guide/): This guide focuses on building a conceptual understanding of the general challenges faced whilst building inference infrastructure and how to best tackle these challenges with Triton Inference Server. From f21a8e9624d5a9406ff819c3ab5d15a8e9f21494 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 12 Oct 2023 20:43:45 -0700 Subject: [PATCH 02/13] add details --- Popular_Models_Guide/Llama2/trtllm_guide.md | 119 ++++++++------------ 1 file changed, 46 insertions(+), 73 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 64be6689..7d91dec7 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -42,18 +42,18 @@ Alternatively, you can follow instructions [here](https://github.com/triton-infe Don't forget to allow gpu usage when you launch the container. -## Create Engines for each model [skip this step if you already have a engine] +## Create Engines for each model [skip this step if you already have an engine] TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. To do so, you will need to complete the following steps: 1. Install Tensorrt-LLM python package ```bash -# TensorRT-LLM is required for generating engines. -pip install git+https://github.com/NVIDIA/TensorRT-LLM.git -mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ -cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ -``` + # TensorRT-LLM is required for generating engines. + pip install git+https://github.com/NVIDIA/TensorRT-LLM.git + mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ + cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ + ``` 2. Log in to huggingface-cli @@ -61,96 +61,69 @@ cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packa huggingface-cli login --token hf_***** ``` -3. Compile model (3 min) - - +3. Compile model engines + The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). We use the one located in the docker container as + `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`. + This command compiles the model with in flight batching and 1 GPU. More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md). ```bash - python3 examples/llama/build.py \ - --model_dir meta-llama/Llama-2-7b-chat-hf \ - --dtype float16 \ - --use_gpt_attention_plugin float16 \ - --use_gemm_plugin float16 \ - --output_dir ../tensorrt_llm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1 \ - --world_size 1 + python build.py --model_dir //Llama-2-7b-chat-hf/ \ + --dtype bfloat16 \ + --use_gpt_attention_plugin bfloat16 \ + --use_inflight_batching \ + --paged_kv_cache \ + --remove_input_padding \ + --use_gemm_plugin bfloat16 \ + --output_dir //1-gpu/ + --world-size 1 ``` - -python build.py --model_dir /mnt/nvdl/usr/katheriney/Llama-2-7b-chat-hf/ \ - --dtype bfloat16 \ - --use_gpt_attention_plugin bfloat16 \ - --use_inflight_batching \ - --paged_kv_cache \ - --remove_input_padding \ - --use_gemm_plugin bfloat16 \ - --output_dir /mnt/nvdl/usr/katheriney/engines/bf16/1-gpu/ - - ```bash - python3 examples/llama/build.py - --model_dir /data/meta-llama/Llama-2-7b-chat-hf/ \ - --dtype float16 --use_gpt_attention_plugin bfloat16 \ - --use_gemm_plugin bfloat16 \ - --output_dir /data/meta-llama/gen/7B/trt_engines/bf16/1-gpu/ - ``` - - > Optional: You can check test the output of the model with the following command: + > Optional: You can check test the output of the model with `run.py` + > located in the same llama examples folder. > > ```bash - > python3 examples/llama/run.py --engine_dir=../tensorrt_llm_backend/all_models/gpt/tensorrt_llm/1/ --max_output_len 100 --tokenizer_dir meta-llama/Llama-2-7b-chat-hf --input_text "How do I count to ten in French?" + > python3 /run.py --engine_dir=/1-gpu/ --max_output_len 100 --tokenizer_dir /Llama-2-7b-chat-hf --input_text "How do I count to ten in French?" > ``` ## Serving with Triton -> Note: WIP, this part doesnt work yet because it uses the wrong tokenizer - -13. Launch Triton Docker container +We're almost there! The last step is to create a Triton readable model. You can +find a template of a model that uses in flight batching in [tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm). +To run our Llama2-7B model, you will need to: - ```bash - cd .. - docker run -it --rm --gpus all --network host --shm-size=1g -v $(pwd)/all_models:/app/all_models triton_trt_llm - ``` - +1. Copy over the inflight batcher models repository + ```bash + cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. + ``` -14. Install TensorRT-LLM into container (6 sec) +2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps ```bash - pip install tensorrt_llm/build/tensorrt_llm-0.1.3-py3-none-any.whl + # preprocessing + sed -i 's#${tokenizer_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt + sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt + sed -i 's#${tokenizer_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt + sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt + + sed -i 's#${decoupled_mode}#false#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt + sed -i 's#${engine_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt ``` + Also, ensure that the `gpt_model_type` parameter is set to `inflight_fused_batching` -15. Modify model config file - - +3. Launch Tritonserver ```bash - sed -i 's#${decoupled_mode}#true#' all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt - sed -i 's#${engine_dir}#/app/all_models/inflight_batcher_llm/tensorrt_llm/1#' all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt + tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm ``` -16. Launch Tritonserver (20 mins) - - ```bash - tritonserver --model-repository /app/all_models/gpt --log-verbose 1 - ``` +## Client - ```bash - tritonserver --model-repository /app/all_models/inflight_batcher_llm --log-verbose 1 - ``` +You can test the results of the run with the [inflight_batcher_llm_client.py script](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm) -## Client +```bash +python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 +``` -WIP From 017bb8a41df7bf29330c9e32c5607eadb3c14e66 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Fri, 13 Oct 2023 14:43:44 -0700 Subject: [PATCH 03/13] fix pre-commit --- Popular_Models_Guide/Llama2/trtllm_guide.md | 20 ++++++++++---------- README.md | 4 ++-- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 7d91dec7..3936d43b 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -35,21 +35,21 @@ Clone the repo of the model with weights and tokens [here](https://huggingface.c ## Installation -Launch Triton docker container with TensorRT-LLM backend +Launch Triton docker container with TensorRT-LLM backend ```docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash``` -Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container. +Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container. Don't forget to allow gpu usage when you launch the container. ## Create Engines for each model [skip this step if you already have an engine] -TensorRT-LLM requires each model to be compiled for the configuration you need before running. -To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. +TensorRT-LLM requires each model to be compiled for the configuration you need before running. +To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. To do so, you will need to complete the following steps: 1. Install Tensorrt-LLM python package ```bash - # TensorRT-LLM is required for generating engines. + # TensorRT-LLM is required for generating engines. pip install git+https://github.com/NVIDIA/TensorRT-LLM.git mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ @@ -78,7 +78,7 @@ To do so, you will need to complete the following steps: --world-size 1 ``` - > Optional: You can check test the output of the model with `run.py` + > Optional: You can check test the output of the model with `run.py` > located in the same llama examples folder. > > ```bash @@ -94,10 +94,10 @@ To run our Llama2-7B model, you will need to: 1. Copy over the inflight batcher models repository ```bash - cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. + cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. ``` -2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps +2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps ```bash # preprocessing @@ -105,13 +105,13 @@ To run our Llama2-7B model, you will need to: sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt sed -i 's#${tokenizer_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt - + sed -i 's#${decoupled_mode}#false#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt sed -i 's#${engine_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt ``` Also, ensure that the `gpt_model_type` parameter is set to `inflight_fused_batching` -3. Launch Tritonserver +3. Launch Tritonserver ```bash tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm diff --git a/README.md b/README.md index 8d158e6f..bf863f25 100644 --- a/README.md +++ b/README.md @@ -15,9 +15,9 @@ The focus of these examples is to demonstrate deployment for models trained with | --------------- | ------------ | --------------- | --------------- | --------------- | #### Supported Model Table -The table below contains a +The table below contains a | Model Name | Supported with HuggingFace format | Supported with TensorRT-LLM Backend | Supported with vLLM Backend | -| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | +| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | | [Llama2-7B](https://ai.meta.com/llama/) | [Llama-2](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | :grey_question:| | [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) |:white_check_mark: |:grey_question: | :white_check_mark: | | [Falcon-180B](https://falconllm.tii.ae/index.html) |:white_check_mark: |:grey_question: | :white_check_mark: | From f0ad6ca36993229af7a7031f1e22100682cf50c2 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 26 Oct 2023 16:28:22 -0700 Subject: [PATCH 04/13] addressed comments --- Popular_Models_Guide/Llama2/trtllm_guide.md | 33 +++++++++++++++------ 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 3936d43b..31c3b5c4 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -26,7 +26,7 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. --> -Note: This tutorial is for TensorRT-LLM Backend which is currently under development. +Note: This tutorial is for TensorRT-LLM Backend which is currently under development so is subject to change. ## Pre-build instructions @@ -35,10 +35,20 @@ Clone the repo of the model with weights and tokens [here](https://huggingface.c ## Installation -Launch Triton docker container with TensorRT-LLM backend +1. The installation starts with cloning the TensorRT-LLM Backend and update the TensorRT-LLM submodule: +```bash +git clone https://github.com/triton-inference-server/tensorrtllm_backend.git +# Update the submodules +cd tensorrtllm_backend +git submodule update --init --recursive +git lfs install +git lfs pull +``` + +2. Then launch Triton docker container with TensorRT-LLM backend ```docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash``` -Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container. +Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Triton Server with Tensorrt-LLM Backend if you want to build a specialized container. Don't forget to allow gpu usage when you launch the container. @@ -62,12 +72,13 @@ To do so, you will need to complete the following steps: ``` 3. Compile model engines + The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). We use the one located in the docker container as `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`. - This command compiles the model with in flight batching and 1 GPU. More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md). + This command compiles the model with inflight batching and 1 GPU. More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md). ```bash - python build.py --model_dir //Llama-2-7b-chat-hf/ \ + python build.py --model_dir //Llama-2-7b-hf/ \ --dtype bfloat16 \ --use_gpt_attention_plugin bfloat16 \ --use_inflight_batching \ @@ -82,22 +93,23 @@ To do so, you will need to complete the following steps: > located in the same llama examples folder. > > ```bash - > python3 /run.py --engine_dir=/1-gpu/ --max_output_len 100 --tokenizer_dir /Llama-2-7b-chat-hf --input_text "How do I count to ten in French?" + > python3 /run.py --engine_dir=/1-gpu/ --max_output_len 100 --tokenizer_dir /Llama-2-7b-hf --input_text "How do I count to ten in French?" > ``` ## Serving with Triton -We're almost there! The last step is to create a Triton readable model. You can +The last step is to create a Triton readable model. You can find a template of a model that uses in flight batching in [tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm). To run our Llama2-7B model, you will need to: 1. Copy over the inflight batcher models repository + ```bash cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. ``` -2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps +2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. See details in [documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#create-the-model-repository): ```bash # preprocessing @@ -119,11 +131,14 @@ To run our Llama2-7B model, you will need to: ## Client -You can test the results of the run with the [inflight_batcher_llm_client.py script](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm) +You can test the results of the run with: +1. The [inflight_batcher_llm_client.py script](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm) ```bash python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 ``` +2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint) if you are using the Triton TensorRT-LLM Backend container with versions greater than `r23.10`. + From a8044ce9abd03145bf74a73a277dd0c5cea47cb1 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 26 Oct 2023 16:41:25 -0700 Subject: [PATCH 05/13] update table to include notes --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index bf863f25..6aacb51e 100644 --- a/README.md +++ b/README.md @@ -14,17 +14,17 @@ The focus of these examples is to demonstrate deployment for models trained with | [PyTorch Model](./Quick_Deploy/PyTorch/README.md) | [TensorFlow Model](./Quick_Deploy/TensorFlow/README.md) | [ONNX Model](./Quick_Deploy/ONNX/README.md) | [TensorRT Accelerated Model](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) | [vLLM Model](./Quick_Deploy/vLLM/README.md) | --------------- | ------------ | --------------- | --------------- | --------------- | -#### Supported Model Table -The table below contains a +#### Example models +The table below contains some popular models that are supported in our tutorials | Model Name | Supported with HuggingFace format | Supported with TensorRT-LLM Backend | Supported with vLLM Backend | | :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | | [Llama2-7B](https://ai.meta.com/llama/) | [Llama-2](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | :grey_question:| -| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) |:white_check_mark: |:grey_question: | :white_check_mark: | -| [Falcon-180B](https://falconllm.tii.ae/index.html) |:white_check_mark: |:grey_question: | :white_check_mark: | -| [Mistral-7B](https://mistral.ai/news/announcing-mistral-7b/)|:white_check_mark: |:grey_question: | :white_check_mark: | - - +| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |:grey_question: | :white_check_mark: | +| [Falcon-180B](https://falconllm.tii.ae/index.html) |[HuggingFace tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |:grey_question: | :white_check_mark: | +**Note:** +1. :white_check_mark: indicates that the model has been verified to work with said backend, :grey_question: indicates the model has not been verified to work. +2. This is not an exhausitive list of what Triton supports, just what is included in the tutorials. ## What does this repository contain? This repository contains the following resources: From 924e36d9fabb3e75903b87f7e895718d60008de7 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 26 Oct 2023 16:42:48 -0700 Subject: [PATCH 06/13] fixed typos --- Popular_Models_Guide/Llama2/trtllm_guide.md | 4 ++-- README.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 31c3b5c4..d69eaeec 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -85,8 +85,8 @@ To do so, you will need to complete the following steps: --paged_kv_cache \ --remove_input_padding \ --use_gemm_plugin bfloat16 \ - --output_dir //1-gpu/ - --world-size 1 + --output_dir //1-gpu/ \ + --world_size 1 ``` > Optional: You can check test the output of the model with `run.py` diff --git a/README.md b/README.md index 6aacb51e..879ac825 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ The table below contains some popular models that are supported in our tutorials | [Falcon-180B](https://falconllm.tii.ae/index.html) |[HuggingFace tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |:grey_question: | :white_check_mark: | **Note:** -1. :white_check_mark: indicates that the model has been verified to work with said backend, :grey_question: indicates the model has not been verified to work. +1. :white_check_mark: indicates that the model has been verified to work with said backend, :grey_question: indicates the model has not been verified to work. 2. This is not an exhausitive list of what Triton supports, just what is included in the tutorials. ## What does this repository contain? From dce85407bbc66bdd075142e85c2773362ef3548b Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 26 Oct 2023 17:03:28 -0700 Subject: [PATCH 07/13] addressed more comments --- Popular_Models_Guide/Llama2/trtllm_guide.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index d69eaeec..2d7f2419 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -26,8 +26,6 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. --> -Note: This tutorial is for TensorRT-LLM Backend which is currently under development so is subject to change. - ## Pre-build instructions For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. From 045e719febd218808399f6d942ef3f85b51e9a3e Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Thu, 26 Oct 2023 18:53:05 -0700 Subject: [PATCH 08/13] fixed table --- README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 879ac825..a0fd7ec2 100644 --- a/README.md +++ b/README.md @@ -16,15 +16,14 @@ The focus of these examples is to demonstrate deployment for models trained with #### Example models The table below contains some popular models that are supported in our tutorials -| Model Name | Supported with HuggingFace format | Supported with TensorRT-LLM Backend | Supported with vLLM Backend | -| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | -| [Llama2-7B](https://ai.meta.com/llama/) | [Llama-2](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | :grey_question:| -| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |:grey_question: | :white_check_mark: | -| [Falcon-180B](https://falconllm.tii.ae/index.html) |[HuggingFace tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) |:grey_question: | :white_check_mark: | +| Model Name | Tutorial Link | +| :-------------: | :------------------------------: | +| [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[TensorRT-LLM Tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | +| [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | + [Falcon-180B](https://falconllm.tii.ae/index.html) |[HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | **Note:** -1. :white_check_mark: indicates that the model has been verified to work with said backend, :grey_question: indicates the model has not been verified to work. -2. This is not an exhausitive list of what Triton supports, just what is included in the tutorials. +This is not an exhausitive list of what Triton supports, just what is included in the tutorials. ## What does this repository contain? This repository contains the following resources: From fb303846c27c50836d883be153843ee51522ec6e Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Fri, 27 Oct 2023 12:21:11 -0700 Subject: [PATCH 09/13] addressed comments --- Popular_Models_Guide/Llama2/trtllm_guide.md | 17 ++++++++++------- README.md | 4 ++-- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 2d7f2419..d51fee9c 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -29,7 +29,8 @@ ## Pre-build instructions For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. -Clone the repo of the model with weights and tokens [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main). You will need to get permissions for the Llama2 repository as well as get access to the huggingface cli. To get access to the huggingface cli, go here: [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). +Clone the repo of the model with weights and tokens [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main). +You will need to get permissions for the Llama2 repository as well as get access to the huggingface cli. To get access to the huggingface cli, go here: [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). ## Installation @@ -51,9 +52,7 @@ Alternatively, you can follow instructions [here](https://github.com/triton-infe Don't forget to allow gpu usage when you launch the container. ## Create Engines for each model [skip this step if you already have an engine] -TensorRT-LLM requires each model to be compiled for the configuration you need before running. -To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. -To do so, you will need to complete the following steps: +TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want with the following steps: 1. Install Tensorrt-LLM python package ```bash @@ -71,9 +70,9 @@ To do so, you will need to complete the following steps: 3. Compile model engines - The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). We use the one located in the docker container as - `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`. - This command compiles the model with inflight batching and 1 GPU. More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md). + The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). We use the one located in the docker container as `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`. + This command compiles the model with inflight batching and 1 GPU. To run with more GPUs, you will need to change the build command to use `--world_size X`. + More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md). ```bash python build.py --model_dir //Llama-2-7b-hf/ \ @@ -126,6 +125,10 @@ To run our Llama2-7B model, you will need to: ```bash tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm ``` + Note if you built the engine with `--world-size X` where `X` is greater than 1, you will need to use the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. + ```bash + python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/opt/tritonserver/inflight_batcher_llm + ``` ## Client diff --git a/README.md b/README.md index a0fd7ec2..c02473d5 100644 --- a/README.md +++ b/README.md @@ -16,11 +16,11 @@ The focus of these examples is to demonstrate deployment for models trained with #### Example models The table below contains some popular models that are supported in our tutorials -| Model Name | Tutorial Link | +| Example Models | ####Tutorial Link | | :-------------: | :------------------------------: | | [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[TensorRT-LLM Tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | | [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | - [Falcon-180B](https://falconllm.tii.ae/index.html) |[HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | +[Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) |[HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | **Note:** This is not an exhausitive list of what Triton supports, just what is included in the tutorials. From 086b182054fbb3161e0abe00b6bfb5c595356c93 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Fri, 27 Oct 2023 12:22:35 -0700 Subject: [PATCH 10/13] address unseen comment --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index c02473d5..702f739a 100644 --- a/README.md +++ b/README.md @@ -10,13 +10,13 @@ For users experiencing the "Tensor in" & "Tensor out" approach to Deep Learning The focus of these examples is to demonstrate deployment for models trained with various frameworks. These are quick demonstrations made with an understanding that the user is somewhat familiar with Triton. -#### Deploy a ... +### Deploy a ... | [PyTorch Model](./Quick_Deploy/PyTorch/README.md) | [TensorFlow Model](./Quick_Deploy/TensorFlow/README.md) | [ONNX Model](./Quick_Deploy/ONNX/README.md) | [TensorRT Accelerated Model](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) | [vLLM Model](./Quick_Deploy/vLLM/README.md) | --------------- | ------------ | --------------- | --------------- | --------------- | -#### Example models +### LLM Tutorials The table below contains some popular models that are supported in our tutorials -| Example Models | ####Tutorial Link | +| Example Models | ####Tutorial Link | | :-------------: | :------------------------------: | | [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[TensorRT-LLM Tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | | [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | From 0d27224ad41f25373c857f934edf4b129ab0d96f Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Fri, 27 Oct 2023 12:54:28 -0700 Subject: [PATCH 11/13] update title leveling --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 702f739a..f328ff07 100644 --- a/README.md +++ b/README.md @@ -14,9 +14,9 @@ The focus of these examples is to demonstrate deployment for models trained with | [PyTorch Model](./Quick_Deploy/PyTorch/README.md) | [TensorFlow Model](./Quick_Deploy/TensorFlow/README.md) | [ONNX Model](./Quick_Deploy/ONNX/README.md) | [TensorRT Accelerated Model](https://github.com/NVIDIA/TensorRT/tree/main/quickstart/deploy_to_triton) | [vLLM Model](./Quick_Deploy/vLLM/README.md) | --------------- | ------------ | --------------- | --------------- | --------------- | -### LLM Tutorials +## LLM Tutorials The table below contains some popular models that are supported in our tutorials -| Example Models | ####Tutorial Link | +| Example Models | Tutorial Link | | :-------------: | :------------------------------: | | [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[TensorRT-LLM Tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | | [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) | [HuggingFace Transformers Tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers) | From 94309a746a82926c966ba4407c37049afb9bf122 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Fri, 27 Oct 2023 14:15:05 -0700 Subject: [PATCH 12/13] address nits --- Popular_Models_Guide/Llama2/trtllm_guide.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index d51fee9c..73595f5b 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -90,7 +90,7 @@ TensorRT-LLM requires each model to be compiled for the configuration you need b > located in the same llama examples folder. > > ```bash - > python3 /run.py --engine_dir=/1-gpu/ --max_output_len 100 --tokenizer_dir /Llama-2-7b-hf --input_text "How do I count to ten in French?" + > python3 run.py --engine_dir=/1-gpu/ --max_output_len 100 --tokenizer_dir /Llama-2-7b-hf --input_text "How do I count to ten in French?" > ``` ## Serving with Triton @@ -125,9 +125,9 @@ To run our Llama2-7B model, you will need to: ```bash tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm ``` - Note if you built the engine with `--world-size X` where `X` is greater than 1, you will need to use the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. + Note if you built the engine with `--world_size X` where `X` is greater than 1, you will need to use the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. ```bash - python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=4 --model_repo=/opt/tritonserver/inflight_batcher_llm + python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=X --model_repo=/opt/tritonserver/inflight_batcher_llm ``` ## Client From d7be3b2bbb924beee228941d30ffb220246454f5 Mon Sep 17 00:00:00 2001 From: Katherine Yang Date: Fri, 27 Oct 2023 14:31:01 -0700 Subject: [PATCH 13/13] other unresolved nits --- Popular_Models_Guide/Llama2/trtllm_guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 73595f5b..0d424cbc 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -52,7 +52,7 @@ Alternatively, you can follow instructions [here](https://github.com/triton-infe Don't forget to allow gpu usage when you launch the container. ## Create Engines for each model [skip this step if you already have an engine] -TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want with the following steps: +TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Triton Server you will need to create a TensorRT-LLM engine for the model for the configuration you want with the following steps: 1. Install Tensorrt-LLM python package ```bash @@ -96,7 +96,7 @@ TensorRT-LLM requires each model to be compiled for the configuration you need b ## Serving with Triton The last step is to create a Triton readable model. You can -find a template of a model that uses in flight batching in [tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm). +find a template of a model that uses inflight batching in [tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm). To run our Llama2-7B model, you will need to: