Skip to content

EvolvingLMMs-Lab/VideoMMMU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Video-MMMU Icon Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

🏠 LMMs-Lab Homepage | Discord_Thread discord/lmms-eval | 🎓 Project Page |📝 Arxiv Paper | 🤗 Dataset


Annoucement

  • [2025-2] 🎉🎉 We update the leaderboard for Qwen-2.5-VL-72B and mPLUG-Owl3-7B.
  • [2025-1] 🎉🎉 We introduce VideoMMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.

License

VideoMMMU is only used for academic research. Commercial use in any form is prohibited. The copyright of all videos belongs to the video owners. Without prior approval, you cannot distribute, publish, copy, disseminate, or modify VideoMMMU in whole or in part. You must strictly comply with the above restrictions. For further inquiries, please send an email to [email protected].

Evaluation

The evaluation of VideoMMMU is integrated into LMMs-Eval. Below is a detailed instruction of the evaluation.

Installation

For formal usage, you can install the package from PyPI by running the following command:

pip install lmms-eval

For development, you can install the package by cloning the repository and running the following command:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

If you want to test LLaVA, you will have to clone their repo from LLaVA and

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .

Evaluation

We use LLaVA-OneVision-7B as an example in the following commands. You can change --model, and --model_args based on your requirement.

Evaluation of LLaVA-OneVision on VideoMMMU (all 3 tracks)

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Evaluate a single track of VideoMMMU

Perception track:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_perception \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Comprehension track:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_comprehension \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Adaptation track:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32,torch_dype=bfloat16 \
    --tasks video_mmmu_adaptation \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Evaluate the question_only track of VideoMMMU -- Knowledge Acquisition Experiment (∆knowledge)

The "question_only" track consists of a 2-second video that contains only the image associated with the Adaptation Track question.

To evaluate this setting, you can use the following command:

accelerate launch --num_processes=1 --main_process_port 12345 -m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=1,torch_dype=bfloat16 \
    --tasks video_mmmu_adaptation_question_only \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix debug \
    --output_path ./logs/

Adaptation Track setting

To ensure compatibility with LMMs-Eval, the image associated with the Adaptation Track question has been appended in the last frame of the video. A prompt has been added to notify the model that the question image is located in the final frame of the video for the Adaptation Track.

If you use the interleave setting, you can manually insert the image (either the last frame of the video or "image 1" from the HF dataset) into the placeholder <image 1>.

Video-MMMU Leaderboard


We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison. To submit your model results, please send an email to [email protected].

Model Overall Perception Comprehension Adaptation Δknowledge
Human Expert 74.44 84.33 78.67 60.33 +33.1
Claude-3.5-Sonnet 65.78 72.00 69.67 55.67 +11.4
GPT-4o 61.22 66.00 62.00 55.67 +15.6
Qwen-2.5-VL-72B 60.22 69.33 61.00 50.33 +9.7
Gemini 1.5 Pro 53.89 59.00 53.33 49.33 +8.7
Aria 50.78 65.67 46.67 40.00 +3.2
Gemini 1.5 Flash 49.78 57.33 49.00 43.00 -3.3
LLaVA-Video-72B 49.67 59.67 46.00 43.33 +7.1
LLaVA-OneVision-72B 48.33 59.67 42.33 43.00 +6.6
Qwen-2.5-VL-7B 47.44 58.33 44.33 39.67 +2.2
mPLUG-Owl3-7B 42.00 49.33 38.67 38.00 +7.5
MAmmoTH-VL-8B 41.78 51.67 40.00 33.67 +1.5
InternVL2-8B 37.44 47.33 33.33 31.67 -8.5
LLaVA-Video-7B 36.11 41.67 33.33 33.33 -5.3
VILA1.5-40B 34.00 38.67 30.67 32.67 +9.4
LLaVA-OneVision-7B 33.89 40.00 31.00 30.67 -5.6
Llama-3.2-11B 30.00 35.67 32.33 22.00 -
LongVA-7B 23.98 24.00 24.33 23.67 -7.0
VILA1.5-8B 20.89 20.33 17.33 25.00 +5.9

Citation

@article{hu2025videommmu,
    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
    booktitle={arXiv preprint arXiv:2501.13826},
    year={2025},
    url={https://arxiv.org/abs/2501.13826}
}