Official implementation for the paper "ClipCap: CLIP Prefix for Image Captioning"
Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving comparable to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.
In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple mapping network over the raw encoding, and then fine-tune our language model to generate a valid caption. In addition, we present another variant, where we utilize a transformer architecture for the mapping network and avoid the fine-tuning of GPT-2. Still, our light model achieve comaparable to state-of-the-art over nocaps dataset.
A couple of people standing next to an elephant. | A wooden table sitting in front of a window. | A bunch of bananas sitting on top of a table. |
A woman holding a plate with a piece of cake in front of her face. | A wooden table topped with lots of wooden utensils. | A red motorcycle parked on top of a dirt field. |
3D render of a man holding a globe. | Students enjoing the cherry blossoms | Green leaf of lettuce on a white plate. |
The hotel and casino on the waterfront. | The triangle is a symbol of the soul. | Cartoon boy in the bath. |
To help visualize the results we provide a Colab notebook found in notebooks/clip_prefix_captioning_inference.ipynb
.
The notebook will download the pretrained models and run inference on a sample images or
on images of your choosing. It is recommended to run this in Google Colab.
Inference notebook for the transformer mapping network (without fine-tune GPT-2) can be found here for the COCO model (also in notebooks/transformer_inference.ipynb
).
Both COCO and Conceptual Captions pretrained models are available for mlp mapping network. For the transformer (without fine-tuning GPT-2) we provide COCO pretrained model.
- Run it in the browser using replicate.ai UI.
- Integrated to Huggingface Spaces with Gradio. See demo: (currently not supporting beam search)
Clone, create environment and install dependencies:
git clone https://github.com/rmokady/CLIP_prefix_caption && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption
Download train_captions to data/coco/annotations
.
Download training images and validation images and unzip (We use Karpathy et el. split).
Extract CLIP features using (output is data/coco/oscar_split_ViT-B_32_train.pkl
):
python parse_coco.py --clip_model_type ViT-B/32
Train with fine-tuning of GPT2:
python train.py --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/
Train only transformer mapping network:
python train.py --only_prefix --data ./data/coco/oscar_split_ViT-B_32_train.pkl --out_dir ./coco_train/ --mapping_type transformer --num_layres 8 --prefix_length 40 --prefix_length_clip 40
If you wish to use ResNet-based CLIP:
python parse_coco.py --clip_model_type RN50x4
python train.py --only_prefix --data ./data/coco/oscar_split_RN50x4_train.pkl --out_dir ./coco_train/ --mapping_type transformer --num_layres 8 --prefix_length 40 --prefix_length_clip 40 --is_rn
Download the .TSV train/val files from Conceptual Captions and place them under <data_root> directory.
Download the images and extract CLIP features using (outputs are <data_root>/conceptual_clip_ViT-B_32_train.pkl
and <data_root>/conceptual_clip_ViT-B_32_val.pkl
):
python parse_conceptual.py --clip_model_type ViT-B/32 --data_root <data_root> --num_threads 16
Notice, downloading the images might take a few days.
Train with fine-tuning of GPT2:
python train.py --data <data_root>/conceptual_clip_ViT-B_32_train.pkl --out_dir ./conceptual_train/
Similarly to the COCO training, you can train a transformer mapping network, and / or parse the images using a ResNet-based CLIP.
If you use this code for your research, please cite:
@article{mokady2021clipcap,
title={ClipCap: CLIP Prefix for Image Captioning},
author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
journal={arXiv preprint arXiv:2111.09734},
year={2021}
}
This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions.
For any inquiry please contact us at our email addresses: [email protected] or [email protected].