ViT

The intention for this repo is to build a full Vision Transformer in pytorch from scratch based on the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (https://arxiv.org/abs/2010.11929).

This is soley for my own education to learn about transformers in computer vision, but I share the repo in case someone else profits from this.

mini-ViT.py

mini-ViT.py is a ~350 line of code standalone implementation of a transformer for MNIST handwriting recognotion. It uses 16 layers of Multi-head self attention with 14 heads and a head dimension of 56. The embedding size of 784 which is the flattened image size of MNISTs 28x28 images. For this embedding size (=image size) there is no projection layer needed and the images are directly passed to the first transformer block.

This network architecture is overkill with ~100M parameter, but it reaches ~98.5% accuracy on the test set and doesn't overfit so ill just leave it at that. For anything serious this should be drastically optimized and should work with a fraction of parameters.

mini-ViT_concat.py

Experiment where the transformer is fed a concatination of the original (flattened) input image with the previous output (or with the output of a linear layer for the first transformer block)

This is roughly the same as mini-ViT.py, but due to the concatenation with the the original input the embedding size is 2ximg_size (2784=1568) instead. To better fit the change of the embedding size the amount of heads is increased to 16 and the dimension is increased to 98 (1698=1568). Additionally after the transformer block there is an additional linear projection layer with img_size dimension (784) instead of n_embd (1568) to be able to concatenate the input with the output of the transformer blocks.

Results didnt improve though.

mini-ViT_augment.py

This experiment is using the same architecture as mini-ViT but does additional data augmentation. In addtion dataloader doing augmentation with multiple worker is used to keep the GPU busy.

This reaches a decent 99.3% accuracy.

mini-ViT_cross-attention.py

This experiment will test out an encoder-decoder architecture with cross-attention like in the original Transformer paper. Each layer with the decoder blocks will have a block of self-attention and a block of cross-attention. The cross-attention block will be each have its own encoder block with self-attention that will be fed the original image. This is based on mini-ViT_augment and uses the same data augmention strategy.

With accuracy: 99.3% it didn't reach any improvement over self attention.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
README.md		README.md
mini-ViT.py		mini-ViT.py
mini-ViT_augment.py		mini-ViT_augment.py
mini-ViT_concat.py		mini-ViT_concat.py
mini-ViT_cross-attention.py		mini-ViT_cross-attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViT

mini-ViT.py

mini-ViT_concat.py

mini-ViT_augment.py

mini-ViT_cross-attention.py

About

Releases

Packages

Languages

DFin/ViT

Folders and files

Latest commit

History

Repository files navigation

ViT

mini-ViT.py

mini-ViT_concat.py

mini-ViT_augment.py

mini-ViT_cross-attention.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages