Skip to content

princetonvisualai/icons

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ICONS Logo

Static Badge Hugging Face


ICONS

ICONS: Influence Consensus for Vision-Language Data Selection

Under construction 🚧

[paper][website][dataset]

Authors: Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky

We propose ICONS, a method for selecting vision-language data that optimizes training efficiency by identifying and prioritizing data samples that are consistently valuable across multiple tasks.

News 🔥

  • [01/25] We have released the LLAVA-ICONS-133K dataset on Hugging Face for public use.
  • [12/24] We have released the paper ICONS.

Table of Contents

Installation

First, clone the repository and navigate to the project directory:

git clone https://github.com/princetonvisualai/icons.git
cd icons

To set up the environment for ICONS and LLaVA training (https://github.com/haotian-liu/LLaVA/), you can use the provided environment.yml file to create a Conda environment:

conda create -n icons python=3.10 -y
conda activate icons
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Dataset Download

LLaVA-665K dataset is available on Download Link.

Cambrian-7M dataset is available on Download Link.

Then follow the original repo to download the image data.

You can split the data into random chunks for parallel gradient computation using slurm scripts. For efficient processing, request as many CPUs as possible (e.g., 96 CPUs), as the splitting operation is CPU-intensive and can be parallelized. For example to split the 7M Cambrian dataset into 3000 chunks with 96 CPUs takes about 10-15 minutes.

# Split the LLaVA-665K dataset into chunks (request 32+ CPUs for faster processing)
python utils/split.py path/to/llava_v1_5_mix665k.json data/llava_665k_splits --num-splits 200

# Split the Cambrian-7M dataset into chunks (request 32+ CPUs for faster processing)
python utils/split.py path/to/Cambrian7M_withsystemprompt.jsonl data/cambrian_7m_splits --num-splits 3000

Selection

The ICONS pipeline consists of two main stages:

Stage 1: Specialist (Computing Task-Specific Influence)

  1. Compute Training Data Gradients

    # Submit SLURM jobs for processing training data chunks
    sbatch './scripts/0_slurm_train_grads.sh' 500  # or use other checkpoints, here we use the warmed-up model which is trained after 500 steps.
  2. Merge Gradient Files

    bash ./scripts/1_merge_train_gradient.sh
  3. Process Validation Data

    bash ./scripts/2_get_val_data_grads_all.sh

    Alternatively, if you have access to a SLURM-enabled system, you can run:

    sbatch ./scripts/2_slurm_get_val_data_grads.sh 

    remember to specify the model path and data path.

  4. Compute Influence Matrices

    bash ./scripts/3_specialist.sh

Stage 2: Generalist (Influence Consensus)

  1. Generate Consensus
    bash ./scripts/4_generalist.sh

Training

We follow the training pipeline from LLaVA's official repository and use the selected data for training. The training script for LLaVA-1.5-7B is located in ./llava_train_scripts/finetune_lora.py.

Before training, download the required checkpoint files:

⚠️ Note: We use the LLaVA model checkpoints from before the visual instruction tuning stage (i.e., before training on the 665K instruction data). These checkpoints only contain the pretrained vision-language alignment weights.

7B Vicuna Model + Projector Checkpoint Download
# Download the mm_projector.bin file for LLaVA-1.5-7B training
mkdir -p checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5

wget https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/resolve/main/mm_projector.bin -P checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5

# Download Vicuna-7B-v1.5 base model
git clone https://huggingface.co/lmsys/vicuna-7b-v1.5 checkpoints/vicuna-7b-v1.5
13B Vicuna Model + Projector Checkpoint Download
# Download the mm_projector.bin file for LLaVA-1.5-13B training
mkdir -p checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

wget https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5/resolve/main/mm_projector.bin -P checkpoints/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

# Download Vicuna-13B-v1.5 base model
git clone https://huggingface.co/lmsys/vicuna-13b-v1.5 checkpoints/vicuna-13b-v1.5
8B Llama-3 Model + Projector Checkpoint Download
# Download the mm_projector.bin file for LLaVA-Llama-3-8B training
mkdir -p checkpoints/llava-llama-3-8b

# Download Llama-3-8B base model
git clone https://huggingface.co/xtuner/llava-llama-3-8b checkpoints/llava-llama-3-8b

To start training with the LLaVA-1.5-7B model:

sh llava_train_scripts/7b_finetune_lora.sh

Follow the instructions in the terminal to set the data_path and output_dir.

Inference

For inference after training with selected data, you can choose one of the following two options:

  1. Use the standard evaluation pipeline from LLaVA's official evaluation script.

  2. Use lmms-eval for comprehensive evaluation, which supports the evaluation on dozens of public datasets and allows new dataset onboarding.

Citation

If you find this repository useful for your research, please cite with the following BibTeX entry:

@article{wu2024icons,
  title={ICONS: Influence Consensus for Vision-Language Data Selection},
  author={Wu, Xindi and Xia, Mengzhou and Shao, Rulin and Deng, Zhiwei and Koh, Pang Wei and Russakovsky, Olga},
  journal={arXiv preprint arXiv:2501.00654},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published