ICONS: Influence Consensus for Vision-Language Data Selection
Under construction 🚧
Authors: Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky
We propose ICONS, a method for selecting vision-language data that optimizes training efficiency by identifying and prioritizing data samples that are consistently valuable across multiple tasks.
- [01/25] We have released the LLAVA-ICONS-133K dataset on Hugging Face for public use.
- [12/24] We have released the paper ICONS.
First, clone the repository and navigate to the project directory:
git clone https://github.com/princetonvisualai/icons.git
cd icons
To set up the environment for ICONS and LLaVA training (https://github.com/haotian-liu/LLaVA/), you can use the provided environment.yml
file to create a Conda environment:
conda create -n icons python=3.10 -y
conda activate icons
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
LLaVA-665K dataset is available on Download Link.
Cambrian-7M dataset is available on Download Link.
Then follow the original repo to download the image data.
You can split the data into random chunks for parallel gradient computation using slurm scripts. For efficient processing, request as many CPUs as possible (e.g., 96 CPUs), as the splitting operation is CPU-intensive and can be parallelized. For example to split the 7M Cambrian dataset into 3000 chunks with 96 CPUs takes about 10-15 minutes.
# Split the LLaVA-665K dataset into chunks (request 32+ CPUs for faster processing)
python utils/split.py path/to/llava_v1_5_mix665k.json data/llava_665k_splits --num-splits 200
# Split the Cambrian-7M dataset into chunks (request 32+ CPUs for faster processing)
python utils/split.py path/to/Cambrian7M_withsystemprompt.jsonl data/cambrian_7m_splits --num-splits 3000
The ICONS pipeline consists of two main stages:
-
Compute Training Data Gradients
# Submit SLURM jobs for processing training data chunks sbatch './scripts/0_slurm_train_grads.sh' 500 # or use other checkpoints, here we use the warmed-up model which is trained after 500 steps.
-
Merge Gradient Files
bash ./scripts/1_merge_train_gradient.sh
-
Process Validation Data
bash ./scripts/2_get_val_data_grads_all.sh
Alternatively, if you have access to a SLURM-enabled system, you can run:
sbatch ./scripts/2_slurm_get_val_data_grads.sh
remember to specify the model path and data path.
-
Compute Influence Matrices
bash ./scripts/3_specialist.sh
- Generate Consensus
bash ./scripts/4_generalist.sh
We follow the training pipeline from LLaVA's official repository and use the selected data for training. The training script for LLaVA-1.5-7B is located in ./llava_train_scripts/finetune_lora.py
.
Before training, download the required checkpoint files:
⚠️ Note: We use the LLaVA model checkpoints from before the visual instruction tuning stage (i.e., before training on the 665K instruction data). These checkpoints only contain the pretrained vision-language alignment weights.
7B Vicuna Model + Projector Checkpoint Download
# Download the mm_projector.bin file for LLaVA-1.5-7B training
mkdir -p checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
wget https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/resolve/main/mm_projector.bin -P checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
# Download Vicuna-7B-v1.5 base model
git clone https://huggingface.co/lmsys/vicuna-7b-v1.5 checkpoints/vicuna-7b-v1.5
13B Vicuna Model + Projector Checkpoint Download
# Download the mm_projector.bin file for LLaVA-1.5-13B training
mkdir -p checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5
wget https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5/resolve/main/mm_projector.bin -P checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5
# Download Vicuna-13B-v1.5 base model
git clone https://huggingface.co/lmsys/vicuna-13b-v1.5 checkpoints/vicuna-13b-v1.5
8B Llama-3 Model + Projector Checkpoint Download
# Download the mm_projector.bin file for LLaVA-Llama-3-8B training
mkdir -p checkpoints/llava-llama-3-8b
# Download Llama-3-8B base model
git clone https://huggingface.co/xtuner/llava-llama-3-8b checkpoints/llava-llama-3-8b
To start training with the LLaVA-1.5-7B model:
sh llava_train_scripts/7b_finetune_lora.sh
Follow the instructions in the terminal to set the data_path and output_dir.
For inference after training with selected data, you can choose one of the following two options:
-
Use the standard evaluation pipeline from LLaVA's official evaluation script.
-
Use lmms-eval for comprehensive evaluation, which supports the evaluation on dozens of public datasets and allows new dataset onboarding.
If you find this repository useful for your research, please cite with the following BibTeX entry:
@article{wu2024icons,
title={ICONS: Influence Consensus for Vision-Language Data Selection},
author={Wu, Xindi and Xia, Mengzhou and Shao, Rulin and Deng, Zhiwei and Koh, Pang Wei and Russakovsky, Olga},
journal={arXiv preprint arXiv:2501.00654},
year={2024}
}