Qwen2_VL profiling: TRT model low performance #2551

nzarif · 2024-12-09T18:54:53Z

Hi,

Thank you for adding support for Qwen2_vl model. I am using TRT-LLM version 0.16.0.dev2024112600 from PyPI. I followed this exact steps to build trt engine files and run the trt model for Qwen2-VL-2B-Instruct. I used the run.py script like this to do profiling:

python3 run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
        --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
        --run_profiling  --profiling_iterations 100

This is the results I see:

Latencies per batch (msec)
TRT vision encoder: 139.5
TRTLLM LLM generate: 532.6
Multimodal generate: 763.8

This means the throughput is around 1.3. Meanwhile when I profile the Qwen2-VL-2B-Instruct I've downloaded from Huggingface using PyTorch I get these profiling results:

batch_size 1: throughput=5.38, latency=0.19
batch_size 2: throughput=4.42, latency=0.45
batch_size 4: throughput=2.94, latency=1.36
batch_size 8: throughput=1.98, latency=4.04

If I load the PyTorch model using "attn_implementation": "flash_attention_2" in the args (as suggested by the developer of Qwen) the throughput numbers are even better:

batch_size 1: throughput=7.24, latency=0.14
batch_size 2: throughput=7.23, latency=0.28
batch_size 4: throughput=7.85, latency=0.51

I am ware that the qwen2_vl implementation used for trt-llm profiling is using batch size of 1 and using larger batch sizes may translate to higher throughput. Yet, for the same batch size there is a huge gap between the throughput of Pytorch model and trt model. I expected the trt model to be faster but it is showing considerably higher latency. Can you please look into this?

PS: I am using a machine with NVIDIA RTX 4000 Ada GPU and Ubuntu 22.04.

The text was updated successfully, but these errors were encountered:

sunnyqgg · 2024-12-11T08:47:26Z

Hi @nzarif ， this is fixed and please use the latest main code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2_VL profiling: TRT model low performance #2551

Qwen2_VL profiling: TRT model low performance #2551

nzarif commented Dec 9, 2024

sunnyqgg commented Dec 11, 2024

Qwen2_VL profiling: TRT model low performance #2551

Qwen2_VL profiling: TRT model low performance #2551

Comments

nzarif commented Dec 9, 2024

sunnyqgg commented Dec 11, 2024