International Conference on Learning Representations (ICLR) 2025
Welcome to the official repository for our paper: Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark. Explore our project page here and the benchmark toolkits here!
Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan at UC Berkeley.
Visual Haystacks (VHs) Benchmark Dataset: π€ tsunghanwu/visual_haystacks, π Github Repo
Model Checkpoint: π€tsunghanwu/mirage-llama3.1-8.3B
This paper addresses the challenge of answering questions across tens of thousands of images. Through extensive experiments through our Visual Haystacks (VHs) benchmark, we demonstrated that existing Large Multimodal Models (LMMs) struggle with inputs exceeding 100 images due to API limitations, context overflow, or hardware constraints on 4 A100 GPUs. These models often face issues such as visual distractions, cross-image reasoning difficulties, and positional biases. To overcome these challenges, we developed MIRAGE (8.3B), a pioneering, open-source visual-RAG baseline model based on LMMs capable of handling tens of thousands of images. In brief, MIRAGE integrates a compressor module that reduces image tokens by 18x, a dynamic query-aware retriever to filter irrelevant images, and a custom-trained LMM that can do multi-image reasoning. MIRAGE sets a new standard in open-source performance on the Visual Haystacks (VHs) benchmark and delivers solid results on both single- and multi-image question answering tasks.
- Clone this repository and navigate to mirage folder
git clone https://github.com/visual-haystacks/mirage.git
cd mirage
- Install Package
conda create -n mirage python=3.10 -y
conda activate mirage
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
- Model Checkpoint: π€tsunghanwu/mirage-llama3.1-8.3B
- Demo code (single test case):
CUDA_VISIBLE_DEVICES=X python3 demo.py --model-path [huggingface model id or local path] --image-folder [local image folder] --prompt [prompt path]
- Hereβs a sample output from MIRAGE using some photos I took on my iPhone. (Feel free to give it a star if you think my cat is adorable! πΊβ¨)
- For Visual Haystacks (VHs), download the data from π€ tsunghanwu/visual_haystacks and place it in
playground/data/eval/visual_haystacks
. - For single-image QA, download all data according to LLaVA's instructions.
- For multi-image QA, download RETVQA's test set and place it in
playground/data/eval/retvqa
. For evaluation, refer to RETVQA's GitHub Repo.
In summary, the data structure of playground/data/eval
should look like this:
Show Data Structure
playground/data/eval/ βββ gqa β βββ answers β βββ data # directory β βββ llava_gqa_testdev_balanced.jsonl βββ mmbench β βββ answers β βββ answers_upload β βββ mmbench_dev_20230712.tsv βββ mmbench_cn β βββ answers β βββ answers_upload β βββ mmbench_dev_cn_20231003.tsv βββ mm-vet β βββ answers β βββ images # directory β βββ llava-mm-vet.jsonl β βββ results βββ pope β βββ answers β βββ coco # directory (point to COCO2014) β βββ llava_pope_test.jsonl βββ retvqa β βββ answers β βββ vg # directory (point to Visual Genome directory) β βββ retvqa_test_mirage.json βββ textvqa β βββ answers β βββ llava_textvqa_val_v051_ocr.jsonl β βββ TextVQA_0.5.1_val.json β βββ train_images # directory (download from their website) βββ visual_haystacks β βββ coco # directory (point to COCO2017) β βββ VHs_qa # directory (download from VHs' huggingface) βββ vizwiz β βββ answers β βββ answers_upload β βββ llava_test.jsonl β βββ test # directory (download from their website) βββ vqav2 βββ answers βββ answers_upload βββ llava_vqav2_mscoco_test2015.jsonl βββ llava_vqav2_mscoco_test-dev2015.jsonl βββ test2015 # directory (download from their website)
# Visual Haystacks
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/{vhs_single,vhs_multi}.sh
# VQAv2, GQA, RetVQA
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/{vqav2,gqa,retvqa}.sh
# Vizwiz, TextVQA, POPE, MMBench, MMBench-CN, MM-Vet
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/{vizwiz,textvqa,pope,mmbench,mmbench_cn,mmvet}.sh
Checkpoint | VQAv2 | GQA | VizWiz | TextVQA | POPE | MM-Bench | MM-Bench-CN | MM-Vet |
---|---|---|---|---|---|---|---|---|
π€ tsunghanwu/mirage-llama3.1-8.3B | 76.56 | 59.13 | 40.52 | 56.24 | 85.40 | 69.24 | 66.92 | 33.4 |
- Please download the dataset from π€ tsunghanwu/MIRAGE-training-set.
- For stage-1 pre-training (training Q-Former and MLP projector), download datasets such as CC-12M, LAION-400M, and COCO.
- For stages 2 and 3 pre-training, which involve training Q-Former/MLP projector with high-quality captions and training the downstream retriever module with augmented LLaVA data, download SAM, VG, COCO, TextVQA, OCR_VQA, and GQA to
playground/data
. - For instruction tuning, download SAM, VG, COCO, TextVQA, OCR_VQA, GQA, slidevqa, and webqa to
playground/data
.
Below is the expected data structure for playground/data/eval
:
Show Data Structure
playground/data/ βββ coco β βββ annotations β βββ test2017 β βββ train2017 β βββ val2017 βββ gqa β βββ images βββ ocr_vqa β βββ images βββ sam β βββ images βββ share_textvqa β βββ images βββ slidevqa β βββ images (download from https://drive.google.com/file/d/11bsX48cPpzCfPBnYJgSesvT7rWc84LpH/view) βββ textvqa β βββ train_images βββ vg β βββ VG_100K β βββ VG_100K_2 βββ webqa βββ webqa_images (download from https://drive.google.com/drive/folders/1ApfD-RzvJ79b-sLeBx1OaiPNUYauZdAZ and convert them to .jpg format)
Run the following script with minor modifications as needed. Note: During the finetuning, we found that freezing the downstream retriever but only updating Q-Former/LLM leads to better performance on LLama-3.1-8b, whereas unfreezing the retriever yields better results on vicuna-v1.5-7b.
# Stage 1-3 Pretraining
bash scripts/pretrain_stage{1,2,3}.sh
# Instruction Finetuning
bash scripts/finetune_qformer_lora.sh
Please merge LoRA weights back to the original checkpoint using the following code:
python scripts/merge_lora_weights.py \
--model-path checkpoints/mirage_qformer_ft \
--model-base meta-llama/Meta-Llama-3.1-8B-Instruct \
--save-model-path your_output_path
We are grateful for the foundational code provided by LLaVA and LLaVA-More. Utilizing their resources implies agreement with their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.
If you use our work or our implementation in this repo or find them helpful, please consider giving a citation.
@article{wu2024visual,
title={Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark},
author={Wu, Tsung-Han and Biamby, Giscard and and Quenum, Jerome and Gupta, Ritwik and Gonzalez, Joseph E and Darrell, Trevor and Chan, David M},
journal={International Conference on Learning Representations},
year={2025},
url={https://arxiv.org/abs/2407.13766}
}