- Overview
- Install From the Wheel Package
- Fetch the Sources
- Build TensorRT-LLM in One Step
- Build Step-by-step
This document contains instructions to install TensorRT-LLM for Linux. We recommend the use of Docker to build and run TensorRT-LLM. Instructions to install an environment to run Docker containers for the NVIDIA platform can be found here.
After installing CUDA 12.2 according to the instructions, please execute the following commands to install TensorRT-LLM.
# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
# Install the latest version of TensorRT-LLM
pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
# Check installation
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
Note that users who have debugging needs or use the GNU C++11 ABI need to compile TensorRT-LLM from source.
The first step to build TensorRT-LLM is to fetch the sources:
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install
git lfs pull
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
Note: There are two options to create TensorRT-LLM Docker image and approximate disk space required to build the image is 63 GB
TensorRT-LLM contains a simple command to create a Docker image:
make -C docker release_build
It is possible to add the optional argument CUDA_ARCHS="<list of architectures in CMake format>"
to specify which architectures should be supported by
TensorRT-LLM. It restricts the supported GPU architectures but helps reduce
compilation time:
# Restrict the compilation to Ada and Hopper architectures.
make -C docker release_build CUDA_ARCHS="89-real;90-real"
Once the image is built, the Docker container can be executed using:
make -C docker release_run
The make
command supports the LOCAL_USER=1
argument to switch to the local
user account instead of root
inside the container. The examples of
TensorRT-LLM are installed in directory /app/tensorrt_llm/examples
.
For users looking for more flexibility, TensorRT-LLM has commands to create and run a development container in which TensorRT-LLM can be built.
The following command creates a Docker image for development:
make -C docker build
The image will be tagged locally with tensorrt_llm/devel:latest
. To run the
container, use the following command:
make -C docker run
For users who prefer to work with their own user account in that container
instead of root
, the option LOCAL_USER=1
must be added to the above command
above:
make -C docker run LOCAL_USER=1
On systems without GNU make
or shell support, the Docker image for
development can be built using:
docker build --pull \
--target devel \
--file docker/Dockerfile.multi \
--tag tensorrt_llm/devel:latest \
.
The container can then be run using:
docker run --rm -it \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
--volume ${PWD}:/code/tensorrt_llm \
--workdir /code/tensorrt_llm \
tensorrt_llm/devel:latest
Once in the container, TensorRT-LLM can be built from source using:
# To build the TensorRT-LLM code.
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
# Deploy TensorRT-LLM in your environment.
pip install ./build/tensorrt_llm*.whl
By default, build_wheel.py
enables incremental builds. To clean the build
directory, add the --clean
option:
python3 ./scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt
It is possible to restrict the compilation of TensorRT-LLM to specific CUDA
architectures. For that purpose, the build_wheel.py
script accepts a
semicolon separated list of CUDA architecture as shown in the following
example:
# Build TensorRT-LLM for Ampere.
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --trt_root /usr/local/tensorrt
The list of supported architectures can be found in the
CMakeLists.txt
file.
The C++ Runtime, in particular, GptSession
can be exposed to
Python via bindings. This is currently an opt-in feature which needs to be
explicitly activated during compilation time. The corresponding option --python_bindings
can be specified
to build_wheel.py
in the standard way:
python3 ./scripts/build_wheel.py --python_bindings --trt_root /usr/local/tensorrt
After installing the resulting wheel as described above, the C++ Runtime bindings will be available in
package tensorrt_llm.bindings
. Running help
on this package in a Python interpreter will provide on overview of the
relevant classes. The associated unit tests should also be consulted for understanding the API.
The build_wheel.py
script will also compile the library containing the C++
runtime of TensorRT-LLM. If Python support and torch
modules are not
required, the script provides the option --cpp_only
which restricts the build
to the C++ runtime only:
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --cpp_only --clean
This is particularly useful to avoid linking problems which may be introduced
by particular versions of torch
related to the dual ABI support of
GCC. The
option --clean
will remove the build directory before building. The default
build directory is cpp/build
, which may be overridden using the option
--build_dir
. Run build_wheel.py --help
for an overview of all supported
options.
Clients may choose to link against the shared or the static version of the library. These libraries can be found in the following locations:
cpp/build/tensorrt_llm/libtensorrt_llm.so
cpp/build/tensorrt_llm/libtensorrt_llm_static.a
In addition, one needs to link against the library containing the LLM plugins for TensorRT available here:
cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
When using TensorRT-LLM, you need to add the cpp
and cpp/include
directories to the project's include paths. Only header files contained in
cpp/include
are part of the supported API and may be directly included. Other
headers contained under cpp
should not be included directly since they might
change in future versions.
For examples of how to use the C++ runtime, see the unit tests in gptSessionTest.cpp and the related CMakeLists.txt file.