This repository contains code for evaluating compositional generalization in image captioning models. It additionally contains code for two state-of-the-art image captioning models and a joint model for caption generation and image-sentence ranking, called BUTR. All models are implemented in PyTorch.
It accompanies the following CoNLL 2019 paper:
Compositional Generalization in Image Captioning
Mitja Nikolaus, Mostafa Abdou, Matthew Lamm, Rahul Aralikatte and Desmond Elliott
An implementation of the Show, Attend and Tell model (adapted from a PyTorch Tutorial to Image Captioning).
An implementation of the decoder of the Bottom-Up and Top-Down Attention model. Encoded features from the bottom-up attention model for the COCO dataset can be found on the project's GitHub page.
An implementation of the joint model for caption generation and image-sentence ranking based on the Bottom-Up and Top-Down Attention and VSE++ models.
A model has to be trained and evaluated using the four different dataset splits. Afterwards, the resulting evaluation json files should be merged into a single json containing the results for all 24 held out concept pairs. The average recall@5 and some other statistics can be visualized using plot_recall_results.py