Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2_VL profiling: TRT model low performance #2551

Open
nzarif opened this issue Dec 9, 2024 · 1 comment
Open

Qwen2_VL profiling: TRT model low performance #2551

nzarif opened this issue Dec 9, 2024 · 1 comment

Comments

@nzarif
Copy link

nzarif commented Dec 9, 2024

Hi,

Thank you for adding support for Qwen2_vl model. I am using TRT-LLM version 0.16.0.dev2024112600 from PyPI. I followed this exact steps to build trt engine files and run the trt model for Qwen2-VL-2B-Instruct. I used the run.py script like this to do profiling:

python3 run.py \
        --hf_model_dir tmp/hf_models/${MODEL_NAME} \
        --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
        --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/fp16/1-gpu/ \
        --run_profiling  --profiling_iterations 100

This is the results I see:

Latencies per batch (msec)
TRT vision encoder: 139.5
TRTLLM LLM generate: 532.6
Multimodal generate: 763.8

This means the throughput is around 1.3. Meanwhile when I profile the Qwen2-VL-2B-Instruct I've downloaded from Huggingface using PyTorch I get these profiling results:

batch_size 1: throughput=5.38, latency=0.19
batch_size 2: throughput=4.42, latency=0.45
batch_size 4: throughput=2.94, latency=1.36
batch_size 8: throughput=1.98, latency=4.04

If I load the PyTorch model using "attn_implementation": "flash_attention_2" in the args (as suggested by the developer of Qwen) the throughput numbers are even better:

batch_size 1: throughput=7.24, latency=0.14
batch_size 2: throughput=7.23, latency=0.28
batch_size 4: throughput=7.85, latency=0.51

I am ware that the qwen2_vl implementation used for trt-llm profiling is using batch size of 1 and using larger batch sizes may translate to higher throughput. Yet, for the same batch size there is a huge gap between the throughput of Pytorch model and trt model. I expected the trt model to be faster but it is showing considerably higher latency. Can you please look into this?

PS: I am using a machine with NVIDIA RTX 4000 Ada GPU and Ubuntu 22.04.

@sunnyqgg
Copy link
Collaborator

Hi @nzarif , this is fixed and please use the latest main code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants