Update TensorRT-LLM #1688

kaiyux · 2024-05-28T10:20:02Z

Model Support
- Support Jais, see examples/jais/README.md.
- Support DiT, see examples/dit/README.md.
- Support VILA 1.5.
- Support Video NeVA, see Video NeVAsection in examples/multimodal/README.md.
Features
- Update default model for Whisper to distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.
API
- [BREAKING CHANGE] Migrate Whisper to unified workflow (trtllm-build command), see documents: examples/whisper/README.md.
- [BREAKING CHANGE] Rename free_gpu_memory_fraction in ModelRunnerCpp to kv_cache_free_gpu_memory_fraction
- Add some more options to ModelRunnerCpp, including max_tokens_in_paged_kv_cache, kv_cache_enable_block_reuse and enable_chunked_context
- Python high level API
  - [BREAKING CHANGE] Remove enable_executor from tensorrt_llm.LLM API as it is using the C++ Executor API now.
  - [BREAKING CHANGE] Introduce OutputConfig in generate API.
  - Adapt BuildConfig to the tensorrt_llm.LLM API.
  - Log cleanup on the LLM construction phase, remove most of the trivial logs.
  - Support quitting immediately when users press Ctrl-C.
- [BREAKING CHANGE] Speculative decoding configurations unification
  - Introduction of SpeculativeDecodingMode.h to choose between different speculative decoding techniques.
  - Introduction of SpeculativeDecodingModule.h base class for speculative decoding techniques
  - Remove decodingMode.h
Bug fixes
- Fix stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in [ModelRunner] Fix stop & bad word list pointer offset. #1486.
- Fix some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in Fix typo in examples/whisper, Fix examples/whisper/run_faster_whisper.py #1328.
- Fix Window attention with huge window size(such as 65k) encountered shared mem error #1424
- Fix zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529
- Fix export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in [fix] export failure with CUDA driver < 526 and pynvml>=11.5.0 #1537.
- Fix an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in Fix nmt weight conversion #1660.
- Fix LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in Fix llama conversion with smooth quant #1650.
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.04-py3.
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.04-py3.
- The dependent CUDA version is updated to 12.4.1.
- The dependent PyTorch version is updated to 2.3.0.

EwoutH · 2024-05-29T09:40:55Z

Awesome! Will a new release be created at https://github.com/NVIDIA/TensorRT-LLM/releases ? That would increase visibility and awareness.

Update TensorRT-LLM

1020a8d

Shixiaowei02 approved these changes May 28, 2024

View reviewed changes

kaiyux merged commit f430a4b into main May 28, 2024

kaiyux deleted the kaiyu/update branch May 28, 2024 12:07

This was referenced May 28, 2024

[feat]: Add Option to convert and run distil-whisper large-v3 #1337

Closed

[fix] export failure with CUDA driver < 526 and pynvml>=11.5.0 #1537

Closed

Fix llama conversion with smooth quant #1650

Closed

DreamGenX mentioned this pull request May 29, 2024

Make Executor timeout configurable #1655

Closed

dinhkt mentioned this pull request May 30, 2024

TypeError: ModelRunnerCpp.from_dir() got an unexpected keyword argument 'gpu_weights_percent' #1664

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1688

Update TensorRT-LLM #1688

kaiyux commented May 28, 2024 •

edited

Loading

EwoutH commented May 29, 2024

Update TensorRT-LLM #1688

Update TensorRT-LLM #1688

Conversation

kaiyux commented May 28, 2024 • edited Loading

EwoutH commented May 29, 2024

kaiyux commented May 28, 2024 •

edited

Loading