You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for adding support for Qwen2_vl model. I am using TRT-LLM version 0.16.0.dev2024112600 from PyPI. I followed this exact steps to build trt engine files and run the trt model for Qwen2-VL-2B-Instruct. I used the run.py script like this to do profiling:
Latencies per batch (msec)
TRT vision encoder: 139.5
TRTLLM LLM generate: 532.6
Multimodal generate: 763.8
This means the throughput is around 1.3. Meanwhile when I profile the Qwen2-VL-2B-Instruct I've downloaded from Huggingface using PyTorch I get these profiling results:
If I load the PyTorch model using "attn_implementation": "flash_attention_2" in the args (as suggested by the developer of Qwen) the throughput numbers are even better:
I am ware that the qwen2_vl implementation used for trt-llm profiling is using batch size of 1 and using larger batch sizes may translate to higher throughput. Yet, for the same batch size there is a huge gap between the throughput of Pytorch model and trt model. I expected the trt model to be faster but it is showing considerably higher latency. Can you please look into this?
PS: I am using a machine with NVIDIA RTX 4000 Ada GPU and Ubuntu 22.04.
The text was updated successfully, but these errors were encountered:
Hi,
Thank you for adding support for Qwen2_vl model. I am using TRT-LLM version
0.16.0.dev2024112600
from PyPI. I followed this exact steps to build trt engine files and run the trt model forQwen2-VL-2B-Instruct
. I used therun.py
script like this to do profiling:This is the results I see:
This means the throughput is around 1.3. Meanwhile when I profile the
Qwen2-VL-2B-Instruct
I've downloaded from Huggingface using PyTorch I get these profiling results:If I load the PyTorch model using "attn_implementation": "flash_attention_2" in the args (as suggested by the developer of Qwen) the throughput numbers are even better:
I am ware that the
qwen2_vl
implementation used for trt-llm profiling is using batch size of 1 and using larger batch sizes may translate to higher throughput. Yet, for the same batch size there is a huge gap between the throughput of Pytorch model and trt model. I expected the trt model to be faster but it is showing considerably higher latency. Can you please look into this?PS: I am using a machine with NVIDIA RTX 4000 Ada GPU and Ubuntu 22.04.
The text was updated successfully, but these errors were encountered: