Update TensorRT-LLM (#422)

* Update TensorRT-LLM --------- Co-authored-by: Tltin <[email protected]> Co-authored-by: zhaohb <[email protected]> Co-authored-by: Bradley Heilbrun <[email protected]> Co-authored-by: nqbao11 <[email protected]> Co-authored-by: Nikhil Varghese <[email protected]>
NVIDIA · Nov 17, 2023 · 6755a3f · 6755a3f
1 parent ab7b461
commit 6755a3f
Show file tree

Hide file tree

Showing 225 changed files with 14,014 additions and 6,906 deletions.
diff --git a/3rdparty/cutlass b/3rdparty/cutlass
diff --git a/3rdparty/json b/3rdparty/json
diff --git a/README.md b/README.md
@@ -43,17 +43,22 @@ H200 FP8 achieves 11,819 tok/s on Llama2-13B on a single GPU, and is up to 1.9x
 - [Installation](#installation)
 - [Quick Start](#quick-start)
 - [Support Matrix](#support-matrix)
+  - [Devices](#devices)
+  - [Precision](#precision)
+  - [Key Features](#key-features)
+  - [Models](#models)
 - [Performance](#performance)
 - [Advanced Topics](#advanced-topics)
   - [Quantization](#quantization)
   - [In-flight Batching](#in-flight-batching)
   - [Attention](#attention)
   - [Graph Rewriting](#graph-rewriting)
-  - [Benchmarking](#benchmarking)
+  - [Benchmark](#benchmark)
 - [Troubleshooting](#troubleshooting)
-- [Release Notes](#release-notes)
-  - [Changelog](#changelog)
-  - [Known issues](#known-issues)
+- [Release notes](#release-notes)
+  - [Change Log](#change-log)
+  - [Known Issues](#known-issues)
+  - [Report Issues](#report-issues)
 
 ## TensorRT-LLM Overview
 
@@ -154,14 +159,14 @@ See the BLOOM [example](examples/bloom) for more details and options regarding t
 
 ***3. Run***
 
-The `summarize.py` script can be used to perform the summarization of articles
+The `../summarize.py` script can be used to perform the summarization of articles
 from the CNN Daily dataset:
 
 ```python
-python summarize.py --test_trt_llm \
-                    --hf_model_location ./bloom/560M/ \
-                    --data_type fp16 \
-                    --engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
+python ../summarize.py --test_trt_llm \
+                       --hf_model_dir ./bloom/560M/ \
+                       --data_type fp16 \
+                       --engine_dir ./bloom/560M/trt_engines/fp16/1-gpu/
 ```
 
 More details about the script and how to run the BLOOM model can be found in
@@ -237,19 +242,26 @@ The list of supported models is:
 * [Bert](examples/bert)
 * [Blip2](examples/blip2)
 * [BLOOM](examples/bloom)
-* [ChatGLM](examples/chatglm), including ChatGLM-6B, ChatGLM2-6B, ChatGLM2-6B-32k, ChatGLM3-6B, ChatGLM3-6B-32k
+* [ChatGLM](examples/chatglm)
 * [Falcon](examples/falcon)
+* [Flan-T5](examples/enc_dec)
 * [GPT](examples/gpt)
 * [GPT-J](examples/gptj)
 * [GPT-Nemo](examples/gpt)
 * [GPT-NeoX](examples/gptneox)
+* [InternLM](examples/internlm)
 * [LLaMA](examples/llama)
 * [LLaMA-v2](examples/llama)
+* [Mistral](examples/llama)
 * [MPT](examples/mpt)
 * [OPT](examples/opt)
+* [Qwen](examples/qwen)
+* [Replit Code](examples/mpt)
 * [SantaCoder](examples/gpt)
 * [StarCoder](examples/gpt)
-* [InternLM](examples/internlm)
+* [T5](examples/enc_dec)
+
+Note: [Encoder-Decoder](examples/enc_dec/) provides general encoder-decoder support that contains many encoder-decoder models such as T5, Flan-T5, etc. We unroll the exact model names in the list above to let users find specific models easiler.
 
 ## Performance
 
@@ -311,6 +323,33 @@ may happen. One possible solution is to reduce the amount of memory needed by
 reducing the maximum batch size, input and output lengths. Another option is to
 enable plugins, for example: `--use_gpt_attention_plugin`.
 
+* MPI + Slurm
+
+TensorRT-LLM is a [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)-aware package that uses [`mpi4py`](https://mpi4py.readthedocs.io/en/stable/). If you are running scripts in a [Slurm](https://slurm.schedmd.com/) environment, you might encounter interferences:
+```
+--------------------------------------------------------------------------
+PMI2_Init failed to initialize.  Return code: 14
+--------------------------------------------------------------------------
+--------------------------------------------------------------------------
+The application appears to have been direct launched using "srun",
+but OMPI was not built with SLURM's PMI support and therefore cannot
+execute. There are several options for building PMI support under
+SLURM, depending upon the SLURM version you are using:
+
+  version 16.05 or later: you can use SLURM's PMIx support. This
+  requires that you configure and build SLURM --with-pmix.
+
+  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
+  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
+  install PMI-2. You must then build Open MPI using --with-pmi pointing
+  to the SLURM PMI library location.
+
+Please configure as appropriate and try again.
+--------------------------------------------------------------------------
+```
+As a rule of thumb, if you are running TensorRT-LLM interactively on a Slurm node, prefix your commands with `mpirun -n 1` to run TensorRT-LLM in a dedicated MPI environment, not the one provided by your Slurm allocation.
+For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
+
 ## Release notes
 
   * TensorRT-LLM requires TensorRT 9.1.0.4 and 23.08 containers.

diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -7,18 +7,14 @@ multiple GPUs or multiple nodes with multiple GPUs.
 
 ### 1. Build TensorRT-LLM and benchmarking source code
 
-Please follow the [`installation document`](../../../README.md) to build TensorRT-LLM.
+Please follow the [`installation document`](../../docs/source/installation.md) to build TensorRT-LLM.
+
+Note that the benchmarking source code for C++ runtime is not built by default, you can use the argument `--benchmarks` in [`build_wheel.py`](../../scripts/build_wheel.py) to build that.
 
 Windows users: Follow the
-[`Windows installation document`](../../../windows/README.md)
+[`Windows installation document`](../../windows/README.md)
 instead, and be sure to set DLL paths as specified in
-[Extra Steps for C++ Runtime Usage](../../../windows/README.md#extra-steps-for-c-runtime-usage).
-
-After that, you can build benchmarking source code for C++ runtime
-```
-cd cpp/build
-make -j benchmarks
-```
+[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
 
 ### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
 
@@ -59,6 +55,8 @@ mpirun -n 8 ./benchmarks/gptSessionBenchmark \
 # [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
 ```
 
+If you want to obtain context and generation logits, you could build an enigne with `--gather_all_token_logits` and run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
+
 *Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
 
 ### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)

diff --git a/benchmarks/cpp/gptSessionBenchmark.cpp b/benchmarks/cpp/gptSessionBenchmark.cpp
@@ -18,6 +18,7 @@
 #include "tensorrt_llm/plugins/api/tllmPlugin.h"
 #include "tensorrt_llm/runtime/gptJsonConfig.h"
 #include "tensorrt_llm/runtime/gptSession.h"
+#include "tensorrt_llm/runtime/iTensor.h"
 #include "tensorrt_llm/runtime/memoryCounters.h"
 #include "tensorrt_llm/runtime/tllmLogger.h"
 
@@ -37,7 +38,7 @@ namespace
 void benchmarkGptSession(std::string const& modelName, std::filesystem::path const& dataPath,
     std::vector<int> const& batchSizes, int beamWidth, std::vector<std::vector<int>> const& inOutLen,
     std::shared_ptr<nvinfer1::ILogger> const& logger, int warmUp, int numRuns, int duration,
-    GptSession::Config& sessionConfig, bool cudaGraphMode)
+    GptSession::Config& sessionConfig, bool cudaGraphMode, bool printAllLogits)
 {
 
     std::string modelNameHyphen = modelName;
@@ -60,7 +61,6 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
 
     SamplingConfig samplingConfig{beamWidth};
     samplingConfig.temperature = std::vector{1.0f};
-    samplingConfig.minLength = std::vector{1};
     samplingConfig.randomSeed = std::vector{42ull};
     samplingConfig.topK = std::vector{1};
     samplingConfig.topP = std::vector{0.0f};
@@ -77,6 +77,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
         auto const maxNewTokens = inOut[1];
 
         sessionConfig.maxSequenceLength = maxInputLength + maxNewTokens;
+        samplingConfig.minLength = std::vector{maxNewTokens};
 
         GptSession session{sessionConfig, modelConfig, worldConfig, enginePath.string(), logger};
 
@@ -102,6 +103,7 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                 // copy inputs and wrap into shared_ptr
                 GenerationInput::TensorPtr inputIds;
                 std::vector<int32_t> inputsHost(batchSize * maxInputLength, padId);
+
                 if (inputPacked)
                 {
                     inputIds = bufferManager.copyFrom(
@@ -123,6 +125,17 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                     bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kINT32),
                     bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kINT32)};
 
+                if (session.getModelConfig().computeContextLogits())
+                {
+                    generationOutput.contextLogits
+                        = bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kFLOAT);
+                }
+                if (session.getModelConfig().computeGenerationLogits())
+                {
+                    generationOutput.generationLogits
+                        = bufferManager.emptyTensor(MemoryType::kGPU, nvinfer1::DataType::kFLOAT);
+                    bufferManager.setZero(*generationOutput.generationLogits);
+                }
                 TLLM_LOG_INFO(memoryCounter.toString());
 
                 for (auto r = 0; r < warmUp; ++r)
@@ -168,6 +181,30 @@ void benchmarkGptSession(std::string const& modelName, std::filesystem::path con
                         "%.2f\n",
                         batchSize, maxInputLength, maxNewTokens, averageLatency, tokensPerSec);
                 }
+
+                // logits are store in last rank
+                if (worldConfig.getRank() == worldConfig.getSize() - 1)
+                {
+                    if (session.getModelConfig().computeContextLogits() && printAllLogits)
+                    {
+                        std::cout << "generationOutput.contextLogits.shape: "
+                                  << generationOutput.contextLogits->getShape()
+                                  << std::endl; // (batchsize, prompt_len, vocabsize)
+                        std::cout << "generationOutput.contextLogits" << *generationOutput.contextLogits << std::endl;
+                    }
+
+                    if (session.getModelConfig().computeGenerationLogits() && printAllLogits)
+                    {
+                        std::cout << "generationOutput.generationLogits.shape: "
+                                  << generationOutput.generationLogits->getShape()
+                                  << std::endl; // (batchsize, beamwidth, maxNewTokens-1, vocabsize)
+                        generationOutput.generationLogits->reshape(ITensor::makeShape({batchSize * beamWidth,
+                            maxNewTokens - 1, modelConfig.getVocabSizePadded(worldConfig.getSize())}));
+
+                        std::cout << "generationOutput.generationLogits: " << *generationOutput.generationLogits
+                                  << std::endl;
+                    }
+                }
             }
             catch (std::runtime_error& e)
             {
@@ -231,6 +268,7 @@ int main(int argc, char* argv[])
         "kv_cache_free_gpu_mem_fraction", "K-V Cache Free Gpu Mem Fraction.", cxxopts::value<float>());
 
     options.add_options()("enable_cuda_graph", "Execute GPT session with CUDA graph.");
+    options.add_options()("print_all_logits", "Print all context and generation logits.");
 
     auto result = options.parse(argc, argv);
 
@@ -328,14 +366,15 @@ int main(int argc, char* argv[])
 
     // Argument: Enable CUDA graph
     auto enableCudaGraph = result.count("enable_cuda_graph") > 0;
+    auto printAllLogits = result.count("print_all_logits") > 0;
 
     initTrtLlmPlugins(logger.get());
 
     try
     {
         benchmarkGptSession(result["model"].as<std::string>(), result["engine_dir"].as<std::string>(), batchSizes,
             beamWidth, inOutLen, logger, result["warm_up"].as<int>(), result["num_runs"].as<int>(),
-            result["duration"].as<int>(), sessionConfig, enableCudaGraph);
+            result["duration"].as<int>(), sessionConfig, enableCudaGraph, printAllLogits);
     }
     catch (const std::exception& e)
     {

diff --git a/benchmarks/python/allowed_configs.py b/benchmarks/python/allowed_configs.py
@@ -353,7 +353,7 @@ class ModelConfig(BaseModel):
                     builder_opt=None,
                 )),
     "chatglm_6b":
-    ModelConfig(name="chatglm-6b",
+    ModelConfig(name="chatglm_6b",
                 family="chatglm",
                 benchmark_type="gpt",
                 build_config=BuildConfig(
@@ -370,7 +370,7 @@ class ModelConfig(BaseModel):
                     remove_input_padding=False,
                 )),
     "chatglm2_6b":
-    ModelConfig(name="chatglm2-6b",
+    ModelConfig(name="chatglm2_6b",
                 family="chatglm2",
                 benchmark_type="gpt",
                 build_config=BuildConfig(
@@ -387,7 +387,7 @@ class ModelConfig(BaseModel):
                     remove_input_padding=False,
                 )),
     "chatglm3_6b":
-    ModelConfig(name="chatglm3-6b",
+    ModelConfig(name="chatglm3_6b",
                 family="chatglm3",
                 benchmark_type="gpt",
                 build_config=BuildConfig(

diff --git a/benchmarks/python/gpt_benchmark.py b/benchmarks/python/gpt_benchmark.py
@@ -143,7 +143,7 @@ def __init__(self,
             quant_mode=self.quant_mode,
             use_custom_all_reduce=self.enable_custom_all_reduce,
         )
-        if model_name == 'chatglm-6b':
+        if model_name == 'chatglm_6b':
             self.sampling_config = tensorrt_llm.runtime.SamplingConfig(
                 end_id=130005,
                 pad_id=3,
@@ -152,16 +152,7 @@ def __init__(self,
                 top_p=top_p)
             self.decoder = tensorrt_llm.runtime.ChatGLMGenerationSession(
                 model_config, engine_buffer, self.runtime_mapping)
-        elif model_name == 'chatglm2-6b':
-            self.sampling_config = tensorrt_llm.runtime.SamplingConfig(
-                end_id=2,
-                pad_id=0,
-                num_beams=num_beams,
-                top_k=top_k,
-                top_p=top_p)
-            self.decoder = tensorrt_llm.runtime.GenerationSession(
-                model_config, engine_buffer, self.runtime_mapping)
-        elif model_name == 'chatglm3-6b':
+        elif model_name in ['chatglm2_6b', 'chatglm3_6b']:
             self.sampling_config = tensorrt_llm.runtime.SamplingConfig(
                 end_id=2,
                 pad_id=0,
@@ -402,7 +393,7 @@ def build(self):
                 apply_query_key_layer_scaling=builder_config.
                 apply_query_key_layer_scaling,
                 quant_mode=self.quant_mode,
-                model_version="1")
+                model_name="chatglm_6b")
         elif family == "chatglm2":
             tensorrt_llm_model = tensorrt_llm.models.ChatGLMHeadModel(
                 num_layers=self.num_layers,
@@ -418,7 +409,7 @@ def build(self):
                 apply_query_key_layer_scaling=builder_config.
                 apply_query_key_layer_scaling,
                 quant_mode=self.quant_mode,
-                model_version="2")
+                model_name="chatglm2_6b")
         elif family == "chatglm3":
             tensorrt_llm_model = tensorrt_llm.models.ChatGLMHeadModel(
                 num_layers=self.num_layers,
@@ -434,7 +425,7 @@ def build(self):
                 apply_query_key_layer_scaling=builder_config.
                 apply_query_key_layer_scaling,
                 quant_mode=self.quant_mode,
-                model_version="3")
+                model_name="chatglm3_6b")
         elif family == "bloom":
             tensorrt_llm_model = tensorrt_llm.models.BloomForCausalLM(
                 num_layers=self.num_layers,
@@ -458,6 +449,7 @@ def build(self):
                 max_position_embeddings=self.n_positions,
                 dtype=kv_dtype,
                 bias=self.bias,
+                quant_mode=self.quant_mode,
                 use_alibi=self.use_alibi,
                 new_decoder_architecture=self.new_decoder_architecture,
                 parallel_attention=self.parallel_attention,

diff --git a/benchmarks/python/mem_monitor.py b/benchmarks/python/mem_monitor.py
@@ -22,7 +22,7 @@ def get_memory_info(handle):
                                               version=pynvml.nvmlMemory_v2)
     total = round(mem_info.total / 1024 / 1024 / 1024, 2)
     used = round(mem_info.used / 1024 / 1024 / 1024, 2)
-    free = round(mem_info.used / 1024 / 1024 / 1024, 2)
+    free = round(mem_info.free / 1024 / 1024 / 1024, 2)
     return total, used, free
 
 

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
@@ -237,6 +237,24 @@ if(WIN32)
   set(CMAKE_CXX_FLAGS "/DNOMINMAX ${CMAKE_CXX_FLAGS}")
 endif()
 
+if((MSVC))
+  if((MSVC_VERSION GREATER_EQUAL 1914))
+    # MSVC does not apply the correct __cplusplus version per the C++ standard
+    # by default. This is required for compiling CUTLASS 3.0 kernels on windows
+    # with C++-17 constexpr enabled. The 2017 15.7 MSVC adds /Zc:__cplusplus to
+    # set __cplusplus to 201703 with std=c++17. See
+    # https://learn.microsoft.com/en-us/cpp/build/reference/zc-cplusplus for
+    # more info.
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /Zc:__cplusplus")
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler  /Zc:__cplusplus")
+  else()
+    message(
+      FATAL_ERROR
+        "Build is only supported with Visual Studio 2017 version 15.7 or higher"
+    )
+  endif()
+endif()
+
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda")
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
 if(FAST_MATH)