You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.
To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.
It involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community.
PS. Preliminary benchmarks indicate that comparable performance to vllm can be achieved at low sequence length complexity even with pure TensorRT and not well optimized configuration:
NVIDIA A10/llama2-7b fp16:
Option
vllm(MEAN/MS)
trt+schedule
qps=2,num_samples=50
TTFT 107, TPOT 44
TTFT 121, TPOT 48
qps=2,num_samples=500
-
OOM
by utilizing a) contiguous batch, b) stream-ordered memory reuse, c) a batchful model employing 5 IOptimizationProfiles with shared memory and non-overlopped range , and d) four parallel batchless IOptimizationProfile instances
The text was updated successfully, but these errors were encountered:
tp-nan
changed the title
[Feature Request] optional tracing(onnx etc.) based solution
[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm
Dec 2, 2024
Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.
To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.
It involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community.
PS. Preliminary benchmarks indicate that comparable performance to vllm can be achieved at low sequence length complexity even with pure TensorRT and not well optimized configuration:
NVIDIA A10/llama2-7b fp16:
by utilizing a) contiguous batch, b) stream-ordered memory reuse, c) a batchful model employing 5 IOptimizationProfiles with shared memory and non-overlopped range , and d) four parallel batchless IOptimizationProfile instances
The text was updated successfully, but these errors were encountered: