tensorflow
(1.5
)pyyaml
sentencepiece
Please follow the instructions to install and build SentencePiece. Once it's installed, do not forget to change the SP_PATH
variable in scripts.
Before running the script, look at the links to download the datasets. Depending on the task, you may change the filenames and the folders paths.
cd scripts/wmt
./prepare_data.sh
The script will train a SentencePiece model using the same source and target vocabulary. It will tokenize the dataset and prepare the train/valid/test files.
cd scripts/wmt
./run_wmt_ende.sh
By default (to be modified in wmt-ende.yml
) training will be done on 4 GPUs and during 200,000 steps.
cd scripts/wmt
./eval_wmt_ende.sh
- Pre-tokenized SentencePiece dataset
- Pre-trained averaged model:
This model achieved the following scores:
Test set | NIST BLEU |
---|---|
newstest2014 | 26.9 |
neswtest2017 | 28.0 |