An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), Alexey Dosovitskiy et al.

In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
- We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
When trained on mid-sized datasets such as ImageNet, such models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images).
Vision Transformer (ViT)
- We split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application.
- We train the model on image classification in supervised fashion. (both pretrain and fine-tuning)

patches: To handle 2D images, we reshape the image $x ∈ R^{(H×W)×C}$ into a sequence of flattened 2D patches $x_p∈ R^{N×(P^2·C)}$.
linear projection of flatten patches: The Transformer uses constant latent vector size $D$ through all of its layers, so we flatten the patches and map to $D$ dimensions with a trainable linear projection.
position embeddings: We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings.
Hybrid Architecture: the patch embedding projection $E$ (Eq. 1) is applied to patches extracted from a CNN feature map (spatial size $1×1$ Conv)
Pretraining-Fine-tuning Fashion: we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks

Provide feedback