Skip to content

Commit

Permalink
Update TensorRT-LLM (#422)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM

---------

Co-authored-by: Tltin <[email protected]>
Co-authored-by: zhaohb <[email protected]>
Co-authored-by: Bradley Heilbrun <[email protected]>
Co-authored-by: nqbao11 <[email protected]>
Co-authored-by: Nikhil Varghese <[email protected]>
  • Loading branch information
6 people authored Nov 17, 2023
1 parent ab7b461 commit 6755a3f
Show file tree
Hide file tree
Showing 225 changed files with 14,014 additions and 6,906 deletions.
2 changes: 1 addition & 1 deletion 3rdparty/cutlass
Submodule cutlass updated 2041 files
2 changes: 1 addition & 1 deletion 3rdparty/json
Submodule json updated 165 files
61 changes: 50 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,17 +43,22 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Support Matrix](#support-matrix)
- [Devices](#devices)
- [Precision](#precision)
- [Key Features](#key-features)
- [Models](#models)
- [Performance](#performance)
- [Advanced Topics](#advanced-topics)
- [Quantization](#quantization)
- [In-flight Batching](#in-flight-batching)
- [Attention](#attention)
- [Graph Rewriting](#graph-rewriting)
- [Benchmarking](#benchmarking)
- [Benchmark](#benchmark)
- [Troubleshooting](#troubleshooting)
- [Release Notes](#release-notes)
- [Changelog](#changelog)
- [Known issues](#known-issues)
- [Release notes](#release-notes)
- [Change Log](#change-log)
- [Known Issues](#known-issues)
- [Report Issues](#report-issues)

## TensorRT-LLM Overview

Expand Down Expand Up @@ -154,14 +159,14 @@ See the BLOOM [example](examples/bloom) for more details and options regarding t

***3. Run***

The `summarize.py` script can be used to perform the summarization of articles
The `../summarize.py` script can be used to perform the summarization of articles
from the CNN Daily dataset:

```python
python summarize.py --test_trt_llm \
--hf_model_location ./bloom/560M/ \
--data_type fp16 \
--engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
python ../summarize.py --test_trt_llm \
--hf_model_dir ./bloom/560M/ \
--data_type fp16 \
--engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
```

More details about the script and how to run the BLOOM model can be found in
Expand Down Expand Up @@ -237,19 +242,26 @@ The list of supported models is:
* [Bert](examples/bert)
* [Blip2](examples/blip2)
* [BLOOM](examples/bloom)
* [ChatGLM](examples/chatglm), including ChatGLM-6B, ChatGLM2-6B, ChatGLM2-6B-32k, ChatGLM3-6B, ChatGLM3-6B-32k
* [ChatGLM](examples/chatglm)
* [Falcon](examples/falcon)
* [Flan-T5](examples/enc_dec)
* [GPT](examples/gpt)
* [GPT-J](examples/gptj)
* [GPT-Nemo](examples/gpt)
* [GPT-NeoX](examples/gptneox)
* [InternLM](examples/internlm)
* [LLaMA](examples/llama)
* [LLaMA-v2](examples/llama)
* [Mistral](examples/llama)
* [MPT](examples/mpt)
* [OPT](examples/opt)
* [Qwen](examples/qwen)
* [Replit Code](examples/mpt)
* [SantaCoder](examples/gpt)
* [StarCoder](examples/gpt)
* [InternLM](examples/internlm)
* [T5](examples/enc_dec)

Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easiler.

## Performance

Expand Down Expand Up @@ -311,6 +323,33 @@ may happen. One possible solution is to reduce the amount of memory needed by
reducing the maximum batch size, input and output lengths. Another option is to
enable plugins, for example: `--use_gpt_attention_plugin`.

* MPI + Slurm

TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
```
--------------------------------------------------------------------------
PMI2_Init failed to initialize. Return code: 14
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
```
As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
For example: `mpirun -n 1 python3 examples/gpt/build.py ...`

## Release notes

* TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.
Expand Down
16 changes: 7 additions & 9 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,14 @@ multiple GPUs or multiple nodes with multiple GPUs.

### 1. Build TensorRT-LLM and benchmarking source code

Please follow the [`installation document`](../../../README.md) to build TensorRT-LLM.
Please follow the [`installation document`](../../docs/source/installation.md) to build TensorRT-LLM.

Note that the benchmarking source code for C++ runtime is not built by default, you can use the argument `--benchmarks` in [`build_wheel.py`](../../scripts/build_wheel.py) to build that.

Windows users: Follow the
[`Windows installation document`](../../../windows/README.md)
[`Windows installation document`](../../windows/README.md)
instead, and be sure to set DLL paths as specified in
[Extra Steps for C++ Runtime Usage](../../../windows/README.md#extra-steps-for-c-runtime-usage).

After that, you can build benchmarking source code for C++ runtime
```
cd cpp/build
make -j benchmarks
```
[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).

### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)

Expand Down Expand Up @@ -59,6 +55,8 @@ mpirun -n 8 ./benchmarks/gptSessionBenchmark \
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
```

If you want to obtain context and generation logits, you could build an enigne with `--gather_all_token_logits` and run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.

*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*

### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)
Expand Down
45 changes: 42 additions & 3 deletions benchmarks/cpp/gptSessionBenchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
#include "tensorrt_llm/plugins/api/tllmPlugin.h"
#include "tensorrt_llm/runtime/gptJsonConfig.h"
#include "tensorrt_llm/runtime/gptSession.h"
#include "tensorrt_llm/runtime/iTensor.h"
#include "tensorrt_llm/runtime/memoryCounters.h"
#include "tensorrt_llm/runtime/tllmLogger.h"

Expand All @@ -37,7 +38,7 @@ namespace
void benchmarkGptSession(std::string const& modelName, std::filesystem::path const& dataPath,
std::vector<int> const& batchSizes, int beamWidth, std::vector<std::vector<int>> const& inOutLen,
std::shared_ptr<nvinfer1::ILogger> const& logger, int warmUp, int numRuns, int duration,
GptSession::Config& sessionConfig, bool cudaGraphMode)
GptSession::Config& sessionConfig, bool cudaGraphMode, bool printAllLogits)
{

std::string modelNameHyphen = modelName;
Expand All @@ -60,7 +61,6 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con

SamplingConfig samplingConfig{beamWidth};
samplingConfig.temperature = std::vector{1.0f};
samplingConfig.minLength = std::vector{1};
samplingConfig.randomSeed = std::vector{42ull};
samplingConfig.topK = std::vector{1};
samplingConfig.topP = std::vector{0.0f};
Expand All @@ -77,6 +77,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
auto const maxNewTokens = inOut[1];

sessionConfig.maxSequenceLength = maxInputLength + maxNewTokens;
samplingConfig.minLength = std::vector{maxNewTokens};

GptSession session{sessionConfig, modelConfig, worldConfig, enginePath.string(), logger};

Expand All @@ -102,6 +103,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
// copy inputs and wrap into shared_ptr
GenerationInput::TensorPtr inputIds;
std::vector<int32_t> inputsHost(batchSize * maxInputLength, padId);

if (inputPacked)
{
inputIds = bufferManager.copyFrom(
Expand All @@ -123,6 +125,17 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kINT32),
bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kINT32)};

if (session.getModelConfig().computeContextLogits())
{
generationOutput.contextLogits
= bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kFLOAT);
}
if (session.getModelConfig().computeGenerationLogits())
{
generationOutput.generationLogits
= bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kFLOAT);
bufferManager.setZero(*generationOutput.generationLogits);
}
TLLM_LOG_INFO(memoryCounter.toString());

for (auto r = 0; r < warmUp; ++r)
Expand Down Expand Up @@ -168,6 +181,30 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
"%.2f\n",
batchSize, maxInputLength, maxNewTokens, averageLatency, tokensPerSec);
}

// logits are store in last rank
if (worldConfig.getRank() == worldConfig.getSize() - 1)
{
if (session.getModelConfig().computeContextLogits() && printAllLogits)
{
std::cout << "generationOutput.contextLogits.shape: "
<< generationOutput.contextLogits->getShape()
<< std::endl; // (batchsize, prompt_len, vocabsize)
std::cout << "generationOutput.contextLogits" << *generationOutput.contextLogits << std::endl;
}

if (session.getModelConfig().computeGenerationLogits() && printAllLogits)
{
std::cout << "generationOutput.generationLogits.shape: "
<< generationOutput.generationLogits->getShape()
<< std::endl; // (batchsize, beamwidth, maxNewTokens-1, vocabsize)
generationOutput.generationLogits->reshape(ITensor::makeShape({batchSize * beamWidth,
maxNewTokens - 1, modelConfig.getVocabSizePadded(worldConfig.getSize())}));

std::cout << "generationOutput.generationLogits: " << *generationOutput.generationLogits
<< std::endl;
}
}
}
catch (std::runtime_error& e)
{
Expand Down Expand Up @@ -231,6 +268,7 @@ int main(int argc, char* argv[])
"kv_cache_free_gpu_mem_fraction", "K-V Cache Free Gpu Mem Fraction.", cxxopts::value<float>());

options.add_options()("enable_cuda_graph", "Execute GPT session with CUDA graph.");
options.add_options()("print_all_logits", "Print all context and generation logits.");

auto result = options.parse(argc, argv);

Expand Down Expand Up @@ -328,14 +366,15 @@ int main(int argc, char* argv[])

// Argument: Enable CUDA graph
auto enableCudaGraph = result.count("enable_cuda_graph") > 0;
auto printAllLogits = result.count("print_all_logits") > 0;

initTrtLlmPlugins(logger.get());

try
{
benchmarkGptSession(result["model"].as<std::string>(), result["engine_dir"].as<std::string>(), batchSizes,
beamWidth, inOutLen, logger, result["warm_up"].as<int>(), result["num_runs"].as<int>(),
result["duration"].as<int>(), sessionConfig, enableCudaGraph);
result["duration"].as<int>(), sessionConfig, enableCudaGraph, printAllLogits);
}
catch (const std::exception& e)
{
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/python/allowed_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ class ModelConfig(BaseModel):
builder_opt=None,
)),
"chatglm_6b":
ModelConfig(name="chatglm-6b",
ModelConfig(name="chatglm_6b",
family="chatglm",
benchmark_type="gpt",
build_config=BuildConfig(
Expand All @@ -370,7 +370,7 @@ class ModelConfig(BaseModel):
remove_input_padding=False,
)),
"chatglm2_6b":
ModelConfig(name="chatglm2-6b",
ModelConfig(name="chatglm2_6b",
family="chatglm2",
benchmark_type="gpt",
build_config=BuildConfig(
Expand All @@ -387,7 +387,7 @@ class ModelConfig(BaseModel):
remove_input_padding=False,
)),
"chatglm3_6b":
ModelConfig(name="chatglm3-6b",
ModelConfig(name="chatglm3_6b",
family="chatglm3",
benchmark_type="gpt",
build_config=BuildConfig(
Expand Down
20 changes: 6 additions & 14 deletions benchmarks/python/gpt_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def __init__(self,
quant_mode=self.quant_mode,
use_custom_all_reduce=self.enable_custom_all_reduce,
)
if model_name == 'chatglm-6b':
if model_name == 'chatglm_6b':
self.sampling_config = tensorrt_llm.runtime.SamplingConfig(
end_id=130005,
pad_id=3,
Expand All @@ -152,16 +152,7 @@ def __init__(self,
top_p=top_p)
self.decoder = tensorrt_llm.runtime.ChatGLMGenerationSession(
model_config, engine_buffer, self.runtime_mapping)
elif model_name == 'chatglm2-6b':
self.sampling_config = tensorrt_llm.runtime.SamplingConfig(
end_id=2,
pad_id=0,
num_beams=num_beams,
top_k=top_k,
top_p=top_p)
self.decoder = tensorrt_llm.runtime.GenerationSession(
model_config, engine_buffer, self.runtime_mapping)
elif model_name == 'chatglm3-6b':
elif model_name in ['chatglm2_6b', 'chatglm3_6b']:
self.sampling_config = tensorrt_llm.runtime.SamplingConfig(
end_id=2,
pad_id=0,
Expand Down Expand Up @@ -402,7 +393,7 @@ def build(self):
apply_query_key_layer_scaling=builder_config.
apply_query_key_layer_scaling,
quant_mode=self.quant_mode,
model_version="1")
model_name="chatglm_6b")
elif family == "chatglm2":
tensorrt_llm_model = tensorrt_llm.models.ChatGLMHeadModel(
num_layers=self.num_layers,
Expand All @@ -418,7 +409,7 @@ def build(self):
apply_query_key_layer_scaling=builder_config.
apply_query_key_layer_scaling,
quant_mode=self.quant_mode,
model_version="2")
model_name="chatglm2_6b")
elif family == "chatglm3":
tensorrt_llm_model = tensorrt_llm.models.ChatGLMHeadModel(
num_layers=self.num_layers,
Expand All @@ -434,7 +425,7 @@ def build(self):
apply_query_key_layer_scaling=builder_config.
apply_query_key_layer_scaling,
quant_mode=self.quant_mode,
model_version="3")
model_name="chatglm3_6b")
elif family == "bloom":
tensorrt_llm_model = tensorrt_llm.models.BloomForCausalLM(
num_layers=self.num_layers,
Expand All @@ -458,6 +449,7 @@ def build(self):
max_position_embeddings=self.n_positions,
dtype=kv_dtype,
bias=self.bias,
quant_mode=self.quant_mode,
use_alibi=self.use_alibi,
new_decoder_architecture=self.new_decoder_architecture,
parallel_attention=self.parallel_attention,
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/python/mem_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def get_memory_info(handle):
version=pynvml.nvmlMemory_v2)
total = round(mem_info.total / 1024 / 1024 / 1024, 2)
used = round(mem_info.used / 1024 / 1024 / 1024, 2)
free = round(mem_info.used / 1024 / 1024 / 1024, 2)
free = round(mem_info.free / 1024 / 1024 / 1024, 2)
return total, used, free


Expand Down
18 changes: 18 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,24 @@ if(WIN32)
set(CMAKE_CXX_FLAGS "/DNOMINMAX ${CMAKE_CXX_FLAGS}")
endif()

if((MSVC))
if((MSVC_VERSION GREATER_EQUAL 1914))
# MSVC does not apply the correct __cplusplus version per the C++ standard
# by default. This is required for compiling CUTLASS 3.0 kernels on windows
# with C++-17 constexpr enabled. The 2017 15.7 MSVC adds /Zc:__cplusplus to
# set __cplusplus to 201703 with std=c++17. See
# https://learn.microsoft.com/en-us/cpp/build/reference/zc-cplusplus for
# more info.
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /Zc:__cplusplus")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler /Zc:__cplusplus")
else()
message(
FATAL_ERROR
"Build is only supported with Visual Studio 2017 version 15.7 or higher"
)
endif()
endif()

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
if(FAST_MATH)
Expand Down
Loading

0 comments on commit 6755a3f

Please sign in to comment.