Towards Better Multi-head Attention via Channel-wise Sample Permutation

Introduction

Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA) mechanism. In this study, we propose a simple and novel channel-wise sample permutation (CSP) operator, achieving a new structured MHA with fewer parameters and lower complexity. Given an input matrix, CSP sorts grouped samples of each channel and then circularly shifts the sorted samples of different channels with various steps. This operator is equivalent to implicitly implementing cross-channel attention maps as permutation matrices, which achieves linear complexity and suppresses the risk of rank collapse when representing data. We replace the MHA of some representative models with CSP and test the CSP-based models in several discriminative tasks, including image classification and long sequence analysis. Experiments show that the CSP-based models achieve comparable or better performance with fewer parameters and lower computational costs than the classic Transformer and its state-of-the-art variants.

ViT

Prepare Dataset

The CIFAR-10 and CIFAR-100 are downloaded from https://www.cs.toronto.edu/~kriz/cifar.html. The ImageNet-1k is downloaded from https://www.image-net.org/.

Image Classification Based on ViT

To run image classification experiments, modify config.py and execute one_expe.py.

cd vit
python one_expe.py

LRA

Environment Setup

This repository requires Python 3.8+ and Pytorch 1.11+.

cd lra/mega_csp
pip install -e .

Prepare Dataset

Download the processed data. The original data is from the LRA repo.

Train MEGA using CSP on LRA

To train MEGA using CSP on LRA, modify and run train_lra.sh.

bash train_lra.sh

model_name: The model to be trained, can be one of transformer, mega, lstm and flash.
dataset_name: The six tasks of the LRA, can be one of listops, imdb-4000, aan, cifar10, pathfinder and path-x.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
lra/mega_csp		lra/mega_csp
vit		vit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Better Multi-head Attention via Channel-wise Sample Permutation

Introduction

ViT

Prepare Dataset

Image Classification Based on ViT

LRA

Environment Setup

Prepare Dataset

Train MEGA using CSP on LRA

About

Releases

Packages

Languages

DaShenZi721/CSP

Folders and files

Latest commit

History

Repository files navigation

Towards Better Multi-head Attention via Channel-wise Sample Permutation

Introduction

ViT

Prepare Dataset

Image Classification Based on ViT

LRA

Environment Setup

Prepare Dataset

Train MEGA using CSP on LRA

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages