[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

tp-nan · 2024-12-02T08:32:16Z

Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.

To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.

It involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community.

PS. Preliminary benchmarks indicate that comparable performance to vllm can be achieved at low sequence length complexity even with pure TensorRT and not well optimized configuration:

NVIDIA A10/llama2-7b fp16:

Option	vllm(MEAN/MS)	trt+schedule
qps=2,num_samples=50	TTFT 107, TPOT 44	TTFT 121, TPOT 48
qps=2,num_samples=500	-	OOM

by utilizing a) contiguous batch, b) stream-ordered memory reuse, c) a batchful model employing 5 IOptimizationProfiles with shared memory and non-overlopped range , and d) four parallel batchless IOptimizationProfile instances

tp-nan changed the title ~~[Feature Request] optional tracing(onnx etc.) based solution~~ [Feature Request] optional tracing(onnx etc.) based solution inside trt-llm Dec 2, 2024

hello-11 assigned AdamzNV and ncomly-nvidia Dec 2, 2024

hello-11 added triaged Issue has been triaged by maintainers feature request New feature or request labels Dec 2, 2024

This comment has been minimized.

Sign in to view

tp-nan closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

tp-nan commented Dec 2, 2024 •

edited

Loading

This comment has been minimized.

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

Comments

tp-nan commented Dec 2, 2024 • edited Loading

This comment has been minimized.

tp-nan commented Dec 2, 2024 •

edited

Loading