Skip to content

ExplainableML/flair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FLAIR: VLM with Fine-grained Language-informed Image Representations arxiv

Authors: Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, Stephan Alaniz

Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embed dings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision- language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing mul- timodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision- language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

Methodology

Pre-trained Models

We released the pre-trained FLAIR models on Huggingface. The pre-trained models and their corresponding datasets are listed below:

Checkpoints Pre-trained Datasets
flair-cc3m-recap CC3M-recap
flair-cc12m-recap CC12M-recap
flair-yfcc15m-recap YFCC15M-recap
flair-merged30m Merged30M

Dependencies

The following small tutorial helps you set up a simple python virtual environment to run our code. Since our main dependency is OpenCLIP, which is still updated frequently, you could always check their repo for a detailed tutorial on creating an environment that is best suited for your system. A conda environment is also possible with the same Python and PyTorch version.

1. Create a Virtual Environment

First, navigate to the project’s root directory flair/ and create a virtual environment using Python 3.12:

cd flair/
python3.12 -m venv flair_env

2. Activate and Navigate to src/

Activate the virtual environment and navigate to src/

source flair_env/bin/activate
cd src/

3. Install Dependencies

Our code mainly involves installing open_clip_torch and open_clip_torch[training].

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

The code is tested in Python 3.12 with PyTorch 2.5.1 with CUDA 12.4. Since OpenCLIP is quite dependency-friendly, we would assume other up-to-date versions should also work.

Dataset Preparation

Check EVAL_DATASETS.md to prepare all the inference datasets. For clarity, we provide an example datasets folder with annotation files in datasets/. However, all datasets don't have to be stored in the same directory, you could specify them freely by changing the arguments in src/inference.sh.

Inference with FLAIR

To reproduce the retrieval results in the FLAIR paper, we provide an example inference bash script: src/inference.sh. Below are detailed explanations of important flags:

  • --huggingface-repo-name: Name of the Huggingface repo where the pre-trained models are stored. Should be fixed as 'xiaorui638/flair'.
  • --huggingface-model-name: Name of the pretrained models. Options include:
    • flair-cc3m-recap.pt
    • flair-cc12m-recap.pt
    • flair-yfcc15m-recap.pt
    • flair-merged30m.pt
  • --inference-with-flair: Enable this flag when using the FLAIR model.
  • --precision: Fixed as amp in our paper.
  • --workers: Adjustable according to your system.

Retrieval Tasks

Enable the following flags in src/inference.sh for different retrieval tasks:

  1. Standard Retrieval
    • --coco-data-root-dir: Root directory of the COCO dataset.
    • --flickr-data-root-dir: Root directory of the Flickr30k dataset.
    • --retrieval-coco: Activate the COCO retrieval task.
    • --retrieval-flickr: Activate the Flickr retrieval task.
  2. Fine-grained Retrieval
    • --iiw-retrieval-dir: Root directory of the Image-in-Words dataset.
    • --docci-retrieval-dir: Root directory of the DOCCI dataset.
    • --retrieval-iiw: Activate the Image-in-Words retrieval task.
    • --retrieval-docci: Activate the DOCCI retrieval task.
  3. Long Retrieval
    • --dci-retrieval-dir: Root directory of the DCI dataset.
    • --urban-1k-retrieval-dir: Root directory of the Urban-1K dataset.
    • --sharegpt4v-retrieval-dir: Root directory of the ShareGPT4V dataset.
    • --retrieval-dci: Activate the DCI retrieval task.
    • --retrieval-urban-1k: Activate the Urban1K retrieval task.
    • --retrieval-sharegpt4v-1k: Activate the ShareGPT4V-1K retrieval task.
    • --retrieval-sharegpt4v-10k: Activate the ShareGPT4V-10K retrieval task.

Acknowledgements

We thank OpenCLIP for providing the amazing code base. Meanwhile, we acknowledge DreamLIP and PixelProse for providing us with various pre-training datasets with captions from MLLMs. We are also greateful for LoTLIP for providing the the detailed scheme for long image-text retrieval task.

Citations

If you find our work useful, please cite:

@article{xiao2024flair,
  title={FLAIR: VLM with Fine-grained Language-informed Image Representations},
  author={Xiao, Rui and Kim, Sanghwan and Georgescu, Mariana-Iuliana and Akata, Zeynep and Alaniz, Stephan},
  journal={arXiv preprint arXiv:2412.03561},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published