Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TensorRT-LLM Release 0.14.0
Key Features and Enhancements
LLM
class in the LLM API.finish_reason
andstop_reason
.__repr__
methods for classModule
, thanks to the contribution from @1ytic in Add module __repr__ methods #2191.customAllReduce
performance.orchestrator
mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.API Changes
max_batch_size
of thetrtllm-build
command is set to2048
.builder_opt
from theBuildConfig
class and thetrtllm-build
command.ModelRunnerCpp
class.isParticipant
method to the C++Executor
API to check if the current process is a participant in the executor instance.Model Updates
examples/nemotron_nas/README.md
.examples/deepseek_v1/README.md
.examples/phi/README.md
.Fixed Issues
tensorrt_llm/models/model_weights_loader.py
, thanks to the contribution from @wangkuiyi in Update model_weights_loader.py #2152.tensorrt_llm/runtime/generation.py
, thanks to the contribution from @lkm2835 in Fix duplicated import module #2182.share_embedding
for the models that have nolm_head
in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in Fix check_share_embedding #2232.kv_cache_type
issue in the Python benchmark, thanks to the contribution from @qingquansong in Fix kv_cache_type issue #2219.trtllm-build --fast-build
with fake or random weights. Thanks to @ZJLi2013 for flagging it in trtllm-build with --fast-build ignore transformer layers #2135.use_fused_mlp
when constructingBuildConfig
from dict, thanks for the fix from @ethnzhng in Include use_fused_mlp when constructing BuildConfig from dict #2081.numNewTokensCumSum
. ([Bug] Lookahead decoding is nondeterministic and wrong after the first call to runner.generate #2263)Infrastructure Changes
Documentation
Known Issues