Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

Closed
tp-nan opened this issue Dec 2, 2024 · 1 comment
Closed
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@tp-nan
Copy link

tp-nan commented Dec 2, 2024

Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.

To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.

It involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community.

PS. Preliminary benchmarks indicate that comparable performance to vllm can be achieved at low sequence length complexity even with pure TensorRT and not well optimized configuration:

NVIDIA A10/llama2-7b fp16:

Option vllm(MEAN/MS) trt+schedule
qps=2,num_samples=50 TTFT 107, TPOT 44 TTFT 121, TPOT 48
qps=2,num_samples=500 - OOM

by utilizing a) contiguous batch, b) stream-ordered memory reuse, c) a batchful model employing 5 IOptimizationProfiles with shared memory and non-overlopped range , and d) four parallel batchless IOptimizationProfile instances

@tp-nan tp-nan changed the title [Feature Request] optional tracing(onnx etc.) based solution [Feature Request] optional tracing(onnx etc.) based solution inside trt-llm Dec 2, 2024
@hello-11 hello-11 added triaged Issue has been triaged by maintainers feature request New feature or request labels Dec 2, 2024
@tp-nan

This comment has been minimized.

@tp-nan tp-nan closed this as completed Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants