🏠 LMMs-Lab Homepage | discord/lmms-eval | 🎓 Project Page |📝 Arxiv Paper | 🤗 Dataset
- [2025-2] 🎉🎉 We update the leaderboard for Qwen-2.5-VL-72B and mPLUG-Owl3-7B.
- [2025-1] 🎉🎉 We introduce VideoMMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.
VideoMMMU is only used for academic research. Commercial use in any form is prohibited. The copyright of all videos belongs to the video owners. Without prior approval, you cannot distribute, publish, copy, disseminate, or modify VideoMMMU in whole or in part. You must strictly comply with the above restrictions. For further inquiries, please send an email to [email protected].
The evaluation of VideoMMMU is integrated into LMMs-Eval. Below is a detailed instruction of the evaluation.
For formal usage, you can install the package from PyPI by running the following command:
pip install lmms-eval
For development, you can install the package by cloning the repository and running the following command:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
If you want to test LLaVA, you will have to clone their repo from LLaVA and
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .
We use LLaVA-OneVision-7B as an example in the following commands. You can change --model
, and --model_args
based on your requirement.
Evaluation of LLaVA-OneVision on VideoMMMU (all 3 tracks)
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
--tasks video_mmmu \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Evaluate a single track of VideoMMMU
Perception track:
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
--tasks video_mmmu_perception \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Comprehension track:
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
--tasks video_mmmu_comprehension \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Adaptation track:
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
--tasks video_mmmu_adaptation \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Evaluate the question_only track of VideoMMMU -- Knowledge Acquisition Experiment (∆knowledge)
The "question_only" track consists of a 2-second video that contains only the image associated with the Adaptation Track question.
To evaluate this setting, you can use the following command:
accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=1,torch_dype=bfloat16 \
--tasks video_mmmu_adaptation_question_only \
--batch_size 1 \
--log_samples \
--log_samples_suffix debug \
--output_path ./logs/
Adaptation Track setting
To ensure compatibility with LMMs-Eval, the image associated with the Adaptation Track question has been appended in the last frame of the video. A prompt has been added to notify the model that the question image is located in the final frame of the video for the Adaptation Track.
If you use the interleave setting, you can manually insert the image (either the last frame of the video or "image 1" from the HF dataset) into the placeholder <image 1>.
We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison. To submit your model results, please send an email to [email protected].
Model | Overall | Perception | Comprehension | Adaptation | Δknowledge |
---|---|---|---|---|---|
Human Expert | 74.44 | 84.33 | 78.67 | 60.33 | +33.1 |
Claude-3.5-Sonnet | 65.78 | 72.00 | 69.67 | 55.67 | +11.4 |
GPT-4o | 61.22 | 66.00 | 62.00 | 55.67 | +15.6 |
Qwen-2.5-VL-72B | 60.22 | 69.33 | 61.00 | 50.33 | +9.7 |
Gemini 1.5 Pro | 53.89 | 59.00 | 53.33 | 49.33 | +8.7 |
Aria | 50.78 | 65.67 | 46.67 | 40.00 | +3.2 |
Gemini 1.5 Flash | 49.78 | 57.33 | 49.00 | 43.00 | -3.3 |
LLaVA-Video-72B | 49.67 | 59.67 | 46.00 | 43.33 | +7.1 |
LLaVA-OneVision-72B | 48.33 | 59.67 | 42.33 | 43.00 | +6.6 |
Qwen-2.5-VL-7B | 47.44 | 58.33 | 44.33 | 39.67 | +2.2 |
mPLUG-Owl3-7B | 42.00 | 49.33 | 38.67 | 38.00 | +7.5 |
MAmmoTH-VL-8B | 41.78 | 51.67 | 40.00 | 33.67 | +1.5 |
InternVL2-8B | 37.44 | 47.33 | 33.33 | 31.67 | -8.5 |
LLaVA-Video-7B | 36.11 | 41.67 | 33.33 | 33.33 | -5.3 |
VILA1.5-40B | 34.00 | 38.67 | 30.67 | 32.67 | +9.4 |
LLaVA-OneVision-7B | 33.89 | 40.00 | 31.00 | 30.67 | -5.6 |
Llama-3.2-11B | 30.00 | 35.67 | 32.33 | 22.00 | - |
LongVA-7B | 23.98 | 24.00 | 24.33 | 23.67 | -7.0 |
VILA1.5-8B | 20.89 | 20.33 | 17.33 | 25.00 | +5.9 |
@article{hu2025videommmu,
title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
booktitle={arXiv preprint arXiv:2501.13826},
year={2025},
url={https://arxiv.org/abs/2501.13826}
}