Skip to content

anguyen8/TAB

Repository files navigation

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

by Pooyan Rahmanzadehgervi1, Hung Huy Nguyen1, Rosanne Liu2,3, Long Mai4, Anh Totti Nguyen1

1Auburn University, 2Google DeepMind, 3ML Collective, 4Adobe Research

This repository contains the official implementation of paper TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Abstract

Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the causal contributions of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is \textbf{superior in localizing} changes and in identifying when no changes occur. TAB is the \textbf{first architecture to enable users to intervene} and edit attention maps, often producing expected outputs by VLMs.

Requirements

conda env create -f environment.yml
conda activate tab

Data Preparing

For CLEVR-Change

The official data can be found here: google drive link provided by Robust Change Captioning (ICCV19).

Extracting this file will create data directory.

tar -xzvf clevr_change.tar.gz

For the convenience, you can also download the three json files from link.

You would get

your_data_path
|–– clevr_change/
|   |–– data/
|   |   |–– images/
|   |   |–– nsc_images/
|   |   |–– sc_images/
|   |   |–– sc_images/
|   |   |–– change_captions.json
|   |   |–– no_change_captions.json
|   |   |–– splits.json
|   |   |–– type_mapping.json

For STD

The image pairs and captions can be downloaded using the instructions here privided by Learning to Describe Differences Between Pairs of Similar Images (EMNLP 2018).

You would get

your_data_path
|–– std/
|   |–– resized_images/
|   |–– annotations/

Download CLIP (ViT-B/32 and ViT-B/16) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

Training

We provide the bash scripts needed to run the 2-stage training in the scripts folder.

Acknowledgment

This repository borrows a significant part of its implementation from CLIP4IDC by Guo et al. We greatly appreciate their work, which provided a strong foundation for this project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published