(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
Microsoft COCO dataset contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
Cross modal retrieval: (1) image-text: given an image as query, retrieve texts from a gallery; (2) text-image: given a text as query, retrieval images from a gallery.
Common metrics are recall@k, denotes the recall score after k retrieval efforts.
We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
(Ranked by TR@1.)
Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
---|---|---|---|---|---|---|---|---|
1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | paper, code, demo, blog |
2 | X-VLM | 81.2 | 95.6 | 98.2 | 63.4 | 85.8 | 91.5 | paper, code |
3 | ALBEF | 77.6 | 94.3 | 97.2 | 60.7 | 84.3 | 90.5 | paper, code, blog |
3 | ALIGN | 77.0 | 93.5 | 96.9 | 59.9 | 83.3 | 89.8 | paper |
4 | VinVL | 75.4 | 92.9 | 96.2 | 58.8 | 83.5 | 90.3 | paper, code |
5 | OSCAR | 73.5 | 92.2 | 96.0 | 57.5 | 82.8 | 89.8 | paper, code |
6 | UNITER | 65.7 | 88.6 | 93.8 | 52.9 | 79.9 | 88.0 | paper, code |
cd lavis/datasets/download_scripts && python download_coco.py
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick