- [2025-01] 🎉 Our arXiv paper TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding is released!
- [2024-12] 🔊 Our TinyLLaVA-Video-v1 repository has been established.
This is a framework of Small-scale Large Multimodal Models for video understanding based on TinyLLaVA_Factory.
- The model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling.
- We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models on multiple video understanding benchmarks.
- Clone this repository and navigate to the folder
git clone https://github.com/ZhangXJ199/TinyLLaVA-Video.git
cd TinyLLaVA-Video
- Create a conda environment, activate it and install Packages
conda create -n tinyllava_video python=3.10 -y
conda activate tinyllava_video
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages
pip install flash-attn --no-build-isolation
git pull
pip install -e .
We combine partial data from two datasets: LLaVA-Video-178K and Valley.
Stage | Source | #Sample |
---|---|---|
Pretrain | LLaVA-Video-178K + Valley | 397k |
Finetune | LLaVA-Video-178K | 491k |
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1
, 30_60_s_academic_v0_1
, 0_30_s_youtube_v0_1
, and 30_60_s_youtube_v0_1
, supplemented with the filtered Video-LLaVA. The organized pretraining annotations can be downloaded from here.
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1
, 30_60_s_academic_v0_1
, 0_30_s_youtube_v0_1
, and 30_60_s_youtube_v0_1
. The organized finetune annotations can be downloaded from here.
Organize the image files and annotation files as follows in path/to/your/dataset
:
dataset
├── academic_source
├── liwei_youtube_videos
├── valley
├── text_files
│ ├── cleaned_video_caption.json
│ ├── cleaned_video_openqa.json
You can refer to TinyLLaVA_Factory to modify components such as "llm," "vision_tower," and "train_recipe."
Here's an example for training a LMM using Qwen2.5.
- Replace data paths with yours in
scripts/train/qwen2/train_qwen2_base_video.sh
- Replace
output_dir
with yours inscripts/train/qwen2/pretrain_qwen2_video.sh
- Replace
pretrained_model_path
andoutput_dir
with yours inscripts/train/qwen2/finetune_qwen2_video.sh
- Adjust your GPU ids (localhost) and
per_device_train_batch_size
inscripts/train/qwen2/pretrain_qwen2_video.sh
andscripts/train/qwen2/finetune_qwen2_video.sh
bash scripts/train/qwen2/train_qwen2_base_video.sh
Important hyperparameters used in pretraining and finetuning are provided below.
Training Stage | Global Batch Size | Learning rate | conv_version |
---|---|---|---|
Pretraining | 128 | 1e-4 | pretrain |
Finetuning | 64 | 2e-5 | qwen2_base |
Tips:
Global Batch Size = num of GPUs * per_device_train_batch_size
* gradient_accumulation_steps
, we recommand you always keep global batch size and learning rate as above except for lora tuning your model.
We currently provide evaluations on 4 benchmarks, including Video-MME, MVBench, LongVideoBench, MLVU.
- Download Video-MME and put it under
path/to/your/dataset/eval/Video-MME
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
,conv-mode
andduration
inscripts/eval/videomme.sh
. There are three types ofduration
available for testing:short
,medium
, andlong
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/videomme.sh
- Download MVBench and put it under
path/to/your/dataset/eval/MVBench
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
andconv-mode
inscripts/eval/mvbench.sh
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mvbench.sh
- Download LongVideoBench and put it under
path/to/your/dataset/eval/LongVideoBench
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
andconv-mode
inscripts/eval/lvbench.sh
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/lvbench.sh
- Download MLVU and put it under
path/to/your/dataset/eval/MLVU
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
andconv-mode
inscripts/eval/mlvu.sh
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mlvu.sh
Here, 16 represents sampling 16 frames, and 512 represents using 512 tokens(queries) to represent the video sequence.
VT (HF Path) | LLM (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU |
---|---|---|---|---|---|---|
google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 |
google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 1fps/1024 | 44.6 | 40.4 | 35.3 | 45.9 |
google/siglip-so400m-patch14-384 | microsoft/phi-2 | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 |
google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-1.5B | 16/512 | 34.4 | 39.0 | 29.5 | 40.5 |
- Please change
model_path
,prompt
,video_file
andconv-mode
ineval.py
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 python eval.py
- This repository is based on TinyLLaVA_Factory project.
- Our codebase is built upon the LLaVA project. Great work!