Skip to content

A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.

License

Notifications You must be signed in to change notification settings

ZhangXJ199/TinyLLaVA-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyLLaVA-Video

Paper License GitHub

🎉 News

📌 About

This is a framework of Small-scale Large Multimodal Models for video understanding based on TinyLLaVA_Factory.

architecture

  • The model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling.
  • We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models on multiple video understanding benchmarks.

Installation and Requirements

  1. Clone this repository and navigate to the folder
git clone https://github.com/ZhangXJ199/TinyLLaVA-Video.git
cd TinyLLaVA-Video
  1. Create a conda environment, activate it and install Packages
conda create -n tinyllava_video python=3.10 -y
conda activate tinyllava_video
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages
pip install flash-attn --no-build-isolation
Upgrade to the latest code base
git pull
pip install -e .

Get Started

1. Data Preparation

We combine partial data from two datasets: LLaVA-Video-178K and Valley.

Stage Source #Sample
Pretrain LLaVA-Video-178K + Valley 397k
Finetune LLaVA-Video-178K 491k

Pretrain Data

We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, and 30_60_s_youtube_v0_1, supplemented with the filtered Video-LLaVA. The organized pretraining annotations can be downloaded from here.

Finetune Data

We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, and 30_60_s_youtube_v0_1. The organized finetune annotations can be downloaded from here.

Organize Data

Organize the image files and annotation files as follows in path/to/your/dataset:

dataset
├── academic_source
├── liwei_youtube_videos
├── valley
├── text_files
│   ├── cleaned_video_caption.json
│   ├── cleaned_video_openqa.json

2. Train

You can refer to TinyLLaVA_Factory to modify components such as "llm," "vision_tower," and "train_recipe."

Here's an example for training a LMM using Qwen2.5.

  • Replace data paths with yours in scripts/train/qwen2/train_qwen2_base_video.sh
  • Replace output_dir with yours in scripts/train/qwen2/pretrain_qwen2_video.sh
  • Replace pretrained_model_path and output_dir with yours in scripts/train/qwen2/finetune_qwen2_video.sh
  • Adjust your GPU ids (localhost) and per_device_train_batch_size in scripts/train/qwen2/pretrain_qwen2_video.sh and scripts/train/qwen2/finetune_qwen2_video.sh
bash scripts/train/qwen2/train_qwen2_base_video.sh

Important hyperparameters used in pretraining and finetuning are provided below.

Training Stage Global Batch Size Learning rate conv_version
Pretraining 128 1e-4 pretrain
Finetuning 64 2e-5 qwen2_base

Tips:

Global Batch Size = num of GPUs * per_device_train_batch_size * gradient_accumulation_steps, we recommand you always keep global batch size and learning rate as above except for lora tuning your model.

3. Evaluation

We currently provide evaluations on 4 benchmarks, including Video-MME, MVBench, LongVideoBench, MLVU.

Video-MME

  1. Download Video-MME and put it under path/to/your/dataset/eval/Video-MME.
  2. Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, conv-mode and duration in scripts/eval/videomme.sh. There are three types of duration available for testing: short, medium, and long.
  3. Please use the following command for single-gpu inference.
    CUDA_VISIBLE_DEVICES=0 bash scripts/eval/videomme.sh

MVBench

  1. Download MVBench and put it under path/to/your/dataset/eval/MVBench.
  2. Please change MODEL_PATH, MODEL_NAME, EVAL_DIR and conv-mode in scripts/eval/mvbench.sh.
  3. Please use the following command for single-gpu inference.
    CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mvbench.sh

LongVideoBench

  1. Download LongVideoBench and put it under path/to/your/dataset/eval/LongVideoBench.
  2. Please change MODEL_PATH, MODEL_NAME, EVAL_DIR and conv-mode in scripts/eval/lvbench.sh.
  3. Please use the following command for single-gpu inference.
    CUDA_VISIBLE_DEVICES=0 bash scripts/eval/lvbench.sh

MLVU

  1. Download MLVU and put it under path/to/your/dataset/eval/MLVU.
  2. Please change MODEL_PATH, MODEL_NAME, EVAL_DIR and conv-mode in scripts/eval/mlvu.sh.
  3. Please use the following command for single-gpu inference.
    CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mlvu.sh

Model Zoo

Trained Models

Here, 16 represents sampling 16 frames, and 512 represents using 512 tokens(queries) to represent the video sequence.

Model Performance

VT (HF Path) LLM (HF Path) #Frame/Query Video-MME MVBench LongVideoBench MLVU
google/siglip-so400m-patch14-384 Qwen/Qwen2.5-3B 16/512 44.7 42.5 37.6 48.1
google/siglip-so400m-patch14-384 Qwen/Qwen2.5-3B 1fps/1024 44.6 40.4 35.3 45.9
google/siglip-so400m-patch14-384 microsoft/phi-2 16/512 42.7 42.0 42.2 46.5
google/siglip-so400m-patch14-384 Qwen/Qwen2.5-1.5B 16/512 34.4 39.0 29.5 40.5

Quick Inference Scripts

  1. Please change model_path, prompt, video_file and conv-mode in eval.py.
  2. Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 python eval.py

❤️ Community efforts

  • This repository is based on TinyLLaVA_Factory project.
  • Our codebase is built upon the LLaVA project. Great work!

About

A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published