Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT-LLM v0.12 Update #2164

Merged
merged 2 commits into from
Aug 29, 2024
Merged

TensorRT-LLM v0.12 Update #2164

merged 2 commits into from
Aug 29, 2024

Conversation

Shixiaowei02
Copy link
Collaborator

TensorRT-LLM Release 0.12.0

Key Features and Enhancements

  • Supported LoRA for MoE models.
  • The ModelWeightsLoader is enabled for LLaMA family models (experimental), see docs/source/architecture/model-weights-loader.md.
  • Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
  • Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the LLM class.
  • Supported FP8 OOTB MoE.
  • Supported Starcoder2 SmoothQuant. (smoothquant on starcoder2 #1886)
  • Supported ReDrafter Speculative Decoding, see “ReDrafter” section in docs/source/speculative_decoding.md.
  • Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in support remove_input_padding for BertForSequenceClassification models #1834.
  • Added in-flight batching support for GLM 10B model.
  • Supported gelu_pytorch_tanh activation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.
  • Added chunk_length parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in add chunk_length parameter to Whisper #1909.
  • Added concurrency argument for gptManagerBenchmark.
  • Executor API supports requests with different beam widths, see docs/source/executor.md#sending-requests-with-different-beam-widths.
  • Added the flag --fast_build to trtllm-build command (experimental).

API Changes

  • [BREAKING CHANGE] max_output_len is removed from trtllm-build command, if you want to limit sequence length on engine build stage, specify max_seq_len.
  • [BREAKING CHANGE] The use_custom_all_reduce argument is removed from trtllm-build.
  • [BREAKING CHANGE] The multi_block_mode argument is moved from build stage (trtllm-build and builder API) to the runtime.
  • [BREAKING CHANGE] The build time argument context_fmha_fp32_acc is moved to runtime for decoder models.
  • [BREAKING CHANGE] The arguments tp_size, pp_size and cp_size is removed from trtllm-build command.
  • The C++ batch manager API is deprecated in favor of the C++ executor API, and it will be removed in a future release of TensorRT-LLM.
  • Added a version API to the C++ library, a cpp/include/tensorrt_llm/executor/version.h file is going to be generated.

Model Updates

  • Supported LLaMA 3.1 model.
  • Supported Mamba-2 model.
  • Supported EXAONE model, see examples/exaone/README.md.
  • Supported Qwen 2 model.
  • Supported GLM4 models, see examples/chatglm/README.md.
  • Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in examples/multimodal/README.md.

Fixed Issues

Infrastructure Changes

  • Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
  • Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
  • The dependent TensorRT version is updated to 10.3.0.
  • The dependent CUDA version is updated to 12.5.1.
  • The dependent PyTorch version is updated to 2.4.0.
  • The dependent ModelOpt version is updated to v0.15.0.

Known Issues

  • On Windows, installation of TensorRT-LLM may succeed, but you might hit OSError: exception: access violation reading 0x0000000000000000 when importing the library in Python. See Installing on Windows for workarounds.

@Shixiaowei02 Shixiaowei02 requested a review from byshiue August 29, 2024 09:13
Copy link
Collaborator

@byshiue byshiue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Shixiaowei02 Shixiaowei02 merged commit bfc50a7 into rel Aug 29, 2024
@Shixiaowei02 Shixiaowei02 deleted the preview/rel branch August 29, 2024 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants