Attention Is All You Need

PyTorch implementation of the transformer architecture presented in "Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Architecture

Positional Encoding

The position of each token in a sequence is encoded using the following formula and then added on top of the token's embedding vector.

$$PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})$$

$$PE_{(pos, 2i + 1)} = cos(pos / 10000^{2i / d_{model}})$$

A special property of this positional encoding method is that $PE_{x + k}$ can be represented as a linear function of $PE_{x}$, which allows the model to easily attend to tokens by their relative positions:

$$PE_{(x + k, 2i)} = sin((x + k) / 10000^{2i / d_{model}})$$

$$PE_{(x + k, 2i)} = sin(x / 10000^{2i / d_{model}}) * cos(k / 10000^{2i / d_{model}}) + cos(x / 10000^{2i / d_{model}}) * sin(k / 10000^{2i / d_{model}})$$

$$PE_{(x + k, 2i)} = PE_{(x, 2i)} * cos(k / 10000^{2i / d_{model}}) + PE_{(x, 2i+1)} * sin(k / 10000^{2i / d_{model}})$$

Multi-head Attention

In a multi-head attention sublayer, the input queries, keys, and values are each projected into num_heads vectors of size d_model / num_heads. Then, num_heads scaled dot-product attention operations are performed in parallel, and their outputs are concatenated and projected back into size d_model.

Methods

Overall, this implementation almost exactly follows the architecture and parameters described in [1]. However, due to limited resources, I instead trained using the smaller Multi30k machine translation dataset.

Learning Rate Schedule

The learning rate schedule used in [1] is shown below:

However, during my experiments, models trained using this schedule failed to achieve BLEU scores above 0.01. Instead, I used PyTorch's ReduceLROnPlateau scheduler, which decreases the learning rate by factor=0.5 every time the validation loss plateaus:

Results

For English-to-German translation (EN-DE), my implementation achieved a maximum BLEU score of 27.0 on the test set, which is comparable to the score of 27.3 found in [1].

The trained model weights can be found here.

Notes

Transformers are trained using a technique called "teacher forcing", which is also used to train recurrent neural networks. During training, the model is actually given the ground truth tokens[:n] as input and asked to predict the nth token.

Setup Instructions

Install requirements

python -m pip install -r requirements.txt

Download spacy language pipelines

python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs.CL]

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
docs		docs
experiments/en-de/08_19_2023/17_27_32		experiments/en-de/08_19_2023/17_27_32
modules		modules
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
data.py		data.py
plot.py		plot.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention Is All You Need

Architecture

Positional Encoding

Multi-head Attention

Methods

Learning Rate Schedule

Results

Notes

Setup Instructions

References

About

Releases

Packages

Languages

tanjeffreyz/attention-is-all-you-need

Folders and files

Latest commit

History

Repository files navigation

Attention Is All You Need

Architecture

Positional Encoding

Multi-head Attention

Methods

Learning Rate Schedule

Results

Notes

Setup Instructions

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages