An implementation of Global Self-Attention Network, which proposes an all-attention vision backbone that achieves better results than convolutions with less parameters and compute.
They use a previously discovered linear attention variant with a small modification for further gains (no normalization of the queries), paired with relative positional attention, computed axially for efficiency.
The result is an extremely simple circuit composed of 8 einsums, 1 softmax, and normalization.
$ pip install gsa-pytorch
import torch
from gsa_pytorch import GSA
gsa = GSA(
dim = 3,
dim_out = 64,
dim_key = 32,
heads = 8,
rel_pos_length = 256 # in paper, set to max(height, width). you can also turn this off by omitting this line
)
x = torch.randn(1, 3, 256, 256)
gsa(x) # (1, 64, 256, 256)
@inproceedings{
anonymous2021global,
title={Global Self-Attention Networks},
author={Anonymous},
booktitle={Submitted to International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=KiFeuZu24k},
note={under review}
}