TensorRT-LLM is Nvidia's recommended solution of running Large Language Models(LLMs) on Nvidia GPUs. Read more about TensoRT-LLM here and Triton's TensorRT-LLM Backend here.
NOTE: If some parts of this tutorial doesn't work, it is possible that there
are some version mismatches between the tutorials
and tensorrtllm_backend
repository. Refer to llama.md
for more detailed modifications if necessary. And if you are familiar with
python, you can also try using
High-level API
for LLM workflow.
For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Clone the repo of the model with weights and tokens here. You will need to get permissions for the Llama2 repository as well as get access to the huggingface cli. To get access to the huggingface cli, go here: huggingface.co/settings/tokens.
Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.
Launch Triton docker container with TensorRT-LLM backend.
Note that we're mounting the acquired Llama2-7b model to /root/.cache/huggingface
in the docker container so that Triton CLI could use it and skip the download
step.
Make an engines
folder outside docker to reuse engines for future runs.
Please, make sure to replace <xx.yy> with the version of Triton that you want
to use.
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v </path/to/Llama2/repo>:/root/.cache/huggingface \
-v </path/to/engines>:/engines \
nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3
Install the latest release of Triton CLI:
GIT_REF=<LATEST_RELEASE>
pip install git+https://github.com/triton-inference-server/triton_cli.git@${GIT_REF}
Triton CLI has a single command triton import
that automatically converts HF
checkpoint into TensorRT-LLM checkpoint format, builds TensorRT-LLM engines,
and prepares a Triton model repository:
ENGINE_DEST_PATH=/engines triton import -m llama-2-7b --backend tensorrtllm
Please, note that specifying ENGINE_DEST_PATH
is optional, but recommended
if you want to re-use compiled engines in the future.
After successful run of triton import
, you should see the structure of
a model repository printed in the console:
...
triton - INFO - Current repo at /root/models:
models/
├── llama-2-7b/
│ ├── 1/
│ │ ├── lib/
│ │ │ ├── decode.py
│ │ │ └── triton_decoder.py
│ │ └── model.py
│ └── config.pbtxt
├── postprocessing/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
├── preprocessing/
│ ├── 1/
│ │ └── model.py
│ └── config.pbtxt
└── tensorrt_llm/
├── 1/
└── config.pbtxt
Start server pointing at the default model repository:
triton start
Use the generate endpoint. to send an inference request to the deployed model.
curl -X POST localhost:8000/v2/models/llama-2-7b/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
You should expect the following response:
{"context_logits":0.0,...,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
If you would like to hava a better control over the deployment process, next steps will guide you over the process of TensorRT-LLM engine building process and Triton model repository set up.
This tutorial requires TensorRT-LLM Backend repository. Please note,
that for best user experience we recommend using the latest
release tag
of tensorrtllm_backend
and
the latest Triton Server container.
To clone TensorRT-LLM Backend repository, make sure to run the following set of commands.
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git --branch <release branch>
# Update the submodules
cd tensorrtllm_backend
# Install git-lfs if needed
apt-get update && apt-get install git-lfs -y --no-install-recommends
git lfs install
git submodule update --init --recursive
Launch Triton docker container with TensorRT-LLM backend.
Note that we're mounting tensorrtllm_backend
to /tensorrtllm_backend
and the Llama2 model to /Llama-2-7b-hf
in the docker container for simplicity.
Make an engines
folder outside docker to reuse engines for future runs.
Please, make sure to replace <xx.yy> with the version of Triton that you want
to use.
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v </path/to/tensorrtllm_backend>:/tensorrtllm_backend \
-v </path/to/Llama2/repo>:/Llama-2-7b-hf \
-v </path/to/engines>:/engines \
nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3
Alternatively, you can follow instructions here to build Triton Server with Tensorrt-LLM Backend if you want to build a specialized container.
Don't forget to allow gpu usage when you launch the container.
Optional: For simplicity, we've condensed all following steps into a deploy_trtllm_llama.sh. Make sure to clone tutorials repo to your machine and start the docker container with the tutorial repo mounted to
/tutorials
by adding-v /path/to/tutorials/:/tutorials
to docker run command, listed above. Then, when container has started, simply run the script via/tutorials/Popular_Models_Guide/Llama2/deploy_trtllm_llama.sh <WORLD_SIZE>For how to run an inference request, refer to the Client section of this tutorial.
TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Triton Server you will need to create a TensorRT-LLM engine.
Starting with 24.04 release, Triton Server TensrRT-LLM container comes with pre-installed TensorRT-LLM package, which allows users to build engines inside the Triton container. Simply follow the next steps:
HF_LLAMA_MODEL=/Llama-2-7b-hf
UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b/
ENGINE_DIR=/engines
CONVERT_CHKPT_SCRIPT=/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py
python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir ${ENGINE_DIR} \
--paged_kv_cache enable \
--max_batch_size 4
Optional: You can check test the output of the model with
run.py
located in the same llama examples folder.python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=/engines/1-gpu/ --max_output_len 50 --tokenizer_dir /Llama-2-7b-hf --input_text "What is ML?"
You should expect the following response:
[TensorRT-LLM] TensorRT-LLM version: 0.9.0 ... [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1 Input [Text 0]: "<s> What is ML?" Output [Text 0 Beam 0]: " ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."
The last step is to create a Triton readable model. You can find a template of a model that uses inflight batching in tensorrtllm_backend/all_models/inflight_batcher_llm. To run our Llama2-7B model, you will need to:
- Copy over the inflight batcher models repository
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
- Modify config.pbtxt for the preprocessing, postprocessing and processing steps. The following script do a minimized configuration to run tritonserver, but if you want optimal performance or custom parameters, read details in documentation and perf_best_practices:
# preprocessing
TOKENIZER_DIR=/Llama-2-7b-hf/
TOKENIZER_TYPE=auto
DECOUPLED_MODE=false
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=10000
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching
- Launch Tritonserver
Use the launch_triton_server.py script. This launches multiple instances of tritonserver
with MPI.
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
You should expect the following response:
... I0503 22:01:25.210518 1175 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001 I0503 22:01:25.211612 1175 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000 I0503 22:01:25.254914 1175 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002
To stop Triton Server inside the container, run:
pkill tritonserver
You can test the results of the run with:
- The inflight_batcher_llm_client.py script.
# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v /path/to/tensorrtllm_backend/inflight_batcher_llm/client:/tensorrtllm_client \
-v /path/to/Llama2/repo:/Llama-2-7b-hf \
nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
# Install extra dependencies for the script
pip3 install transformers sentencepiece
python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len 50 --tokenizer-dir /Llama-2-7b-hf/ --text "What is ML?"
You should expect the following response:
... Input: What is ML? Output beam 0: ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation. ...
- The generate endpoint.
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
You should expect the following response:
{"context_logits":0.0,...,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
For more examples feel free to refer to End to end workflow to run llama.