TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
by Pooyan Rahmanzadehgervi1, Hung Huy Nguyen1, Rosanne Liu2,3, Long Mai4, Anh Totti Nguyen1
1Auburn University, 2Google DeepMind, 3ML Collective, 4Adobe Research
This repository contains the official implementation of paper TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the causal contributions of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is \textbf{superior in localizing} changes and in identifying when no changes occur. TAB is the \textbf{first architecture to enable users to intervene} and edit attention maps, often producing expected outputs by VLMs.
conda env create -f environment.yml
conda activate tab
For CLEVR-Change
The official data can be found here: google drive link provided by Robust Change Captioning (ICCV19).
Extracting this file will create data directory.
tar -xzvf clevr_change.tar.gz
For the convenience, you can also download the three json files from link.
You would get
your_data_path
|–– clevr_change/
| |–– data/
| | |–– images/
| | |–– nsc_images/
| | |–– sc_images/
| | |–– sc_images/
| | |–– change_captions.json
| | |–– no_change_captions.json
| | |–– splits.json
| | |–– type_mapping.json
For STD
The image pairs and captions can be downloaded using the instructions here privided by Learning to Describe Differences Between Pairs of Similar Images (EMNLP 2018).
You would get
your_data_path
|–– std/
| |–– resized_images/
| |–– annotations/
Download CLIP (ViT-B/32 and ViT-B/16) weight,
wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
We provide the bash scripts needed to run the 2-stage training in the scripts folder.
This repository borrows a significant part of its implementation from CLIP4IDC by Guo et al. We greatly appreciate their work, which provided a strong foundation for this project.