Mllama ignores input image when deployed in triton #692

mutkach · 2025-02-05T12:59:34Z

System Info

cpu: x86_64
mem: 128G
gpu: H100 80G
docker: tritonserver:24.12-trtllm-python-py3
Cuda: 12.6
Driver: 535.216.01
TensorRT: 10.7.0
TensorRT-LLM: v0.16.0

Who can help?

@kaiyux @byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce:
Using scripts for Mllama build and deployment multimodal.md:
except:

use Visual-Instruct-11B instead of Visual-11B
set max_encoder_input_len to 6404 for Visual-Instruct-11B as indicated by tensorRT-LLM guide
set 1 batch size for testing purposes
checkout v0.16.0 tag for TensorRT-LLM (there's discrepancies when converting checkpoint otherwise)
fill in cross_kv_cache_fraction in config.pbtxt - 0.5 (it won't start in triton otherwise)
starting triton manually with a command
load ensemble model (e2e setup would not work otherwise)

triton command is tritonserver --model-repository=multimodal_ifb --model-control-mode=explicit --log-verbose=3 --load-model=tensorrt_llm --load-model=multimodal_encoders --load-model=ensemble --load-model=tensorrt_llm_bls --cuda-memory-pool-byte-size=0:300000000

Expected behavior

When tested with ...

python3 tensorrt_llm/examples/multimodal/run.py --visual_engine_dir /tmp/mllama/trt_engines/encoder/ \
                                   --visual_engine_name visual_encoder.engine \
                                   --llm_engine_dir /tmp/mllama/trt_engines/decoder/ \
                                   --hf_model_dir Llama-3.2-11B-Vision-Instruct/ \
                                   --image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg \
                                   --input_text "<|image|><|begin_of_text|>If I had to write a haiku for this one" \
                                   --max_new_tokens 50 \
                                   --batch_size 1

output:

", it would be:.\\nA rabbit in a coat.\\nA charming and dapper fellow.\\nHe's a stylish chap indeed. <OCR/> ርርርርርር

Works as expected.

actual behavior

When run with:

python3 tools/multimodal/client.py  --model_type mllama --text "<|image|><|begin_of_text|>If I had to write a haiku for this one" --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffuser
s/rabbit.jpg --top-p 0.7 --temperature 0.9 --top-k 40 --request-output-len 20

the result is:

[beam 0 ]:
<|image|><|begin_of_text|>If I had to write a haiku for this one, it would be:
“Golden sunsets fade
Gone, yet memories remain
Summer's

When shown a different image or provided different runtime parameters would similarly ignore image content (different image -> same output).

additional notes

Double-checked tokenization output and also I checked that image inputs are sent correctly (image_bytes) and verified encoder_input_features and cross_attention_masks (multimodal_encoder outputs) are in the same ballpark (though not same or equal by any means) when run with tensorrt_llm/examples/multimodal/run.py

encoder_input_features in triton:

tensor([[[  8.1875,  12.3750,  -4.5938,  ..., -12.1875,  -4.4062,   5.1250],
         [ -1.0625,  13.5000,   7.4375,  ...,  -2.3125,  -3.0625, -13.2500],
         [-12.5000,   7.0625,   8.5625,  ...,   3.1875,  -0.1836,  -8.4375],
         ...,
         [ -3.8906,  -2.5625,  -6.0938,  ...,  -2.2812,  -8.1875,  -3.0312],
         [  2.7031,   7.0938,  -7.6875,  ...,  -8.5625,  -4.4062, -22.2500],
         [  4.2500,   1.2734,   1.5156,  ...,  -1.8359,  -2.5312,   1.5625]]],
       device='cuda:0', dtype=torch.bfloat16)

in tensorrt_llm runner

tensor([[  8.1250,  12.3750,  -4.6875,  ..., -12.1875,  -4.3438,   5.2500],
        [ -1.1328,  13.3125,   7.4062,  ...,  -2.6250,  -2.9531, -13.1875],
        [-12.3750,   6.9688,   8.5625,  ...,   2.9688,  -0.2139,  -8.6250],
        ...,
        [ -5.4375,  -2.8125,  -6.9375,  ...,  -3.4375,  -7.8125,  -3.7969],
        [  1.1641,   6.9062,  -3.5000,  ...,  -3.0625,  -2.9688, -27.2500],
        [  4.6562,   1.3906,   1.6953,  ...,  -1.6484,  -2.9375,   1.3281]],
       device='cuda:0', dtype=torch.bfloat16)

If that difference is not ok, should I look into that?
Aside of that, the bls setting also not working. The LLM itself seems to be working fine and giving correct responses.

The text was updated successfully, but these errors were encountered:

mutkach · 2025-02-05T13:05:47Z

By the way, thanks for the great work! I would appreciate any helpful directions regarding this issue.

mutkach · 2025-02-06T07:15:48Z

There seems to be a mistake in the multimodal encoder. Specifically the line that sets the skip_cross_attn_blocks unconditionally to torch.ones. Compare to multimodal runner in trt-llm, which sets it to ones only if there's no image data. When changed to torch.zeros in my case the pipeline seems to be working now.

mutkach added the bug Something isn't working label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mllama ignores input image when deployed in triton #692

Mllama ignores input image when deployed in triton #692

mutkach commented Feb 5, 2025 •

edited

Loading

mutkach commented Feb 5, 2025

mutkach commented Feb 6, 2025

Mllama ignores input image when deployed in triton #692

Mllama ignores input image when deployed in triton #692

Comments

mutkach commented Feb 5, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

mutkach commented Feb 5, 2025

mutkach commented Feb 6, 2025

mutkach commented Feb 5, 2025 •

edited

Loading