Update TensorRT-LLM #1725

kaiyux · 2024-06-04T12:01:28Z

Model Support
- Support Grok-1, see examples/grok/README.md.
- Support Qwen1.5-110B with FP8 PTQ.
- Support Phi-3 small model with block sparse attention.
- Support InternLM2 7B/20B, thanks to the contribution from @RunningLeon in Support internlm2 #1392.
Features
- Add a fused GEMM-SwiGLU plugin for FP8 on SM90.
- Support HuggingFace model automatically download for the Python high level API.
- Support running FP8 LLaMA with FP16 LoRA checkpoints.
- Support very long context for LLaMA (see “Long context evaluation” section in examples/llama/README.md).
- Support encoder-decoder models for C++ runtime with paged KV cache and inflight batching. enc-dec triton backend support #800
Bug fixes
- Fix qkv_bias shape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候，qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.
- Fix the error of Ada traits for fpA_intB, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.
- Update examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.
- Fix rsLoRA scaling in lora_manager, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.
- Fix Qwen1.5 checkpoint convert failure convert_checkpoint qwen1.5 error #1675.
- Fix Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in Loading Medusa Safetensors + AWQ Conversion correction #1535.
Documentation
- Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in Add Huggingface model zoo from community #1674.
- Fix benchmarks/cpp/README.md for gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.
- Fix dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: [Docs] Fixed inference-request.md dead link triton-inference-server/tensorrtllm_backend#478, Fixed README.md for broken links triton-inference-server/tensorrtllm_backend#482 and FIX link reference in README.md triton-inference-server/tensorrtllm_backend#449.

open source 736b3fc4259916d31211104b91e6b2b4db995b17

c57e8ec

Shixiaowei02 approved these changes Jun 4, 2024

View reviewed changes

kaiyux merged commit b777bd6 into main Jun 4, 2024

kaiyux deleted the preview/main branch June 4, 2024 12:26

symphonylyh mentioned this pull request Jun 4, 2024

enc-dec triton backend support #800

Closed

DreamGenX mentioned this pull request Jun 16, 2024

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers #1788

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1725

Update TensorRT-LLM #1725

kaiyux commented Jun 4, 2024 •

edited

Loading

Update TensorRT-LLM #1725

Update TensorRT-LLM #1725

Conversation

kaiyux commented Jun 4, 2024 • edited Loading

kaiyux commented Jun 4, 2024 •

edited

Loading