TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

by Pooyan Rahmanzadehgervi¹, Hung Huy Nguyen¹, Rosanne Liu^2,3, Long Mai⁴, Anh Totti Nguyen¹

¹Auburn University, ²Google DeepMind, ³ML Collective, ⁴Adobe Research

This repository contains the official implementation of paper TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Abstract

Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the causal contributions of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is \textbf{superior in localizing} changes and in identifying when no changes occur. TAB is the \textbf{first architecture to enable users to intervene} and edit attention maps, often producing expected outputs by VLMs.

Requirements

conda env create -f environment.yml
conda activate tab

Data Preparing

For CLEVR-Change

The official data can be found here: google drive link provided by Robust Change Captioning (ICCV19).

Extracting this file will create data directory.

tar -xzvf clevr_change.tar.gz

For the convenience, you can also download the three json files from link.

You would get

your_data_path
|–– clevr_change/
|   |–– data/
|   |   |–– images/
|   |   |–– nsc_images/
|   |   |–– sc_images/
|   |   |–– sc_images/
|   |   |–– change_captions.json
|   |   |–– no_change_captions.json
|   |   |–– splits.json
|   |   |–– type_mapping.json

For STD

The image pairs and captions can be downloaded using the instructions here privided by Learning to Describe Differences Between Pairs of Similar Images (EMNLP 2018).

You would get

your_data_path
|–– std/
|   |–– resized_images/
|   |–– annotations/

Download CLIP (ViT-B/32 and ViT-B/16) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

Training

We provide the bash scripts needed to run the 2-stage training in the scripts folder.

Acknowledgment

This repository borrows a significant part of its implementation from CLIP4IDC by Guo et al. We greatly appreciate their work, which provided a strong foundation for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
dataloaders		dataloaders
modules		modules
scripts		scripts
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
eval_clevr_captions.py		eval_clevr_captions.py
main_task_caption.py		main_task_caption.py
main_task_retrieval.py		main_task_retrieval.py
metrics.py		metrics.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Abstract

Requirements

Data Preparing

Training

Acknowledgment

About

Releases

Packages

Languages

License

anguyen8/TAB

Folders and files

Latest commit

History

Repository files navigation

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Abstract

Requirements

Data Preparing

Training

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages