Skip to content

This repo contains the evaluation code for the INQUIRE benchmark

License

Notifications You must be signed in to change notification settings

inquire-benchmark/INQUIRE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INQUIRE

INQUIRE teaser figure

🌐 Homepage | 🖼️ Dataset | 🤗 HuggingFace | 📖 Paper

INQUIRE is an expert-level text-to-image retrieval benchmark designed to challenge multi-modal models.

This dataset aims to emulate real world image retrieval and analysis problems faced by scientists working with large-scale image collections. Therefore, we hope that INQUIRE will both encourage and track advancements in the real scientific utility of AI systems.

🔔 News

  • 🚀 [2024-12-17] INQUIRE was spotlighted by MIT News! Check our the article here.
  • 🚀 [2024-11-06] The paper for INQUIRE is up on arXiv! Check it out here.
  • 🚀 [2024-10-08] INQUIRE was accepted to NeurIPS 2024 (Datasets and Benchmarks Track)!
  • 🚀 [2024-06-07] INQUIRE is up!

🌟 Key Features

  • Large (5 million images) and exhaustively annotated (1-1.5k relevant images per query)
  • Queries come from experts (e.g., ecologists, biologists, ornithologists, entomologists, oceanographers)
  • Supports two-stage retrieval with CLIP models and reranking with large multimodal models.
  • Includes pre-computed embeddings and model outputs for faster evaluation.

Download

The INQUIRE benchmark and the iNaturalist 2024 dataset (iNat24) are available for public download. Please find information and download links here.

Setup

Clone the repository and navigate into it:

git clone https://github.com/inquire-benchmark/INQUIRE.git
cd INQUIRE

If you'd like, you can create a new environment in which to set up the repo:

conda create -n inquire python=3.10
conda activate inquire

Then, install the dependencies:

pip install -r requirements.txt

Our evaluations use pre-computed CLIP embeddings over iNat24. If you'd like to replicate our evaluations or just work with these embeddings, please download them here.

INQUIRE Fullrank Evaluation

INQUIRE-Fullrank is the full-dataset retrieval task, starting from all 5 million images of iNat24. We evaluate one-stage retrieval, using similarity search with CLIP-style models, and two-stage retrieval, where after the initial retrieval, a large multi-modal model is used to rerank the images.

One-stage retrieval with CLIP-style models

To evaluate full-dataset retrieval with different CLIP-style models, you don't necessarily need all 5 million images, but rather their embeddings. You can download our pre-computed embeddings for a variety of models from here. Then, use the following command to evaluate CLIP retrieval:

python src/eval_fullrank.py --split test --k 50

Two-stage retrieval

After the first stage, we can use large multi-modal models to re-rank the top k retrievals to improve results. This stage requires access to the iNat24 images, which you can download here. To run the second stage retrieval, use the following command:

python src/eval_fullrank_two_stage.py --split test --k 50 --from_k 50

The from_k parameter decides the number of top CLIP retrievals to rerank with the large multi-modal model, after which only the top 50 will be kept for final evaluation. In our paper, we use a from_k of 50 and 100.

INQUIRE-Rerank Evaluation

We recommend starting here, as INQUIRE-Rerank is much smaller and easier to work with. INQUIRE-Rerank is available on 🤗 HuggingFace!

INQUIRE-Rerank evaluates reranking performance by fixing an initial retrieval of 100 images for each query (from OpenClip's CLIP ViT-H-14-378). For each query (e.g. A mongoose standing upright alert), your task is to re-order the 100 images so that more of the relevant images are at the "top" of the reranked order.

Requirements

There are no extra requirements for evaluating INQUIRE-Rerank! The data will automatically download from HuggingFace if you don't already have it.

Reranking with embedding models like CLIP

Evaluate reranking performance with CLIP models:

python src/eval_rerank_with_clip.py --split test

Reranking with large multi-modal models

Evaluate reranking performance with large multi-modal models such as LLaVA-34B:

python src/eval_rerank_with_llm.py --split test

Since inference can take a long time, we've pre-computed the outputs for all large multi-modal models we work with! You can download these here.

Citation

If you use INQUIRE or find our work helpful, please consider starring our repo and citing our paper. Thanks!

@article{vendrow2024inquire,
  title={INQUIRE: A Natural World Text-to-Image Retrieval Benchmark}, 
  author={Vendrow, Edward and Pantazis, Omiros and Shepard, Alexander and Brostow, Gabriel and Jones, Kate E and Mac Aodha, Oisin and Beery, Sara and Van Horn, Grant},
  journal={NeurIPS},
  year={2024},
}

About

This repo contains the evaluation code for the INQUIRE benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages