The position of each token in a sequence is encoded using the following formula and then added on top of the token's embedding vector.
A special property of this positional encoding method is that
In a multi-head attention sublayer, the input queries, keys, and values are each projected into
num_heads
vectors of size d_model / num_heads
. Then, num_heads
scaled dot-product
attention operations are performed in parallel, and their outputs are concatenated and projected back into
size d_model
.
Overall, this implementation almost exactly follows the architecture and parameters described in [1].
However, due to limited resources, I instead trained using the smaller Multi30k
machine translation dataset.
The learning rate schedule used in [1] is shown below:
However, during my experiments, models trained using this schedule failed to achieve BLEU scores above 0.01
.
Instead, I used PyTorch's ReduceLROnPlateau
scheduler, which decreases the learning rate by factor=0.5
every
time the validation loss plateaus:
For English-to-German translation (EN-DE), my implementation achieved a maximum BLEU score of 27.0
on the test set,
which is comparable to the score of 27.3
found in [1].
The trained model weights can be found here.
Transformers are trained using a technique called "teacher forcing", which is also used to train recurrent neural networks.
During training, the model is actually given the ground truth tokens[:n]
as input and asked to predict the n
th token.
- Install requirements
python -m pip install -r requirements.txt
- Download spacy language pipelines
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs.CL]