This page contains pointers to pre-trained models as well as instructions on how to train new models for the paper
WMT En-De : See the "Training a new model on WMT'16 En-De" setion in examples/scaling_nmt IWSLT De-En & WMT En-Fr: See the examples/translation
We will provide pre-trained models for test.
Description | Dataset | Model | Test set(s) |
---|---|---|---|
Prime | IWSLT14 German-English | deen35.7-prime_checkpoint70.pt deen36.2-prime_avg70.pt |
IWSLT14 test: download (.tar.bz2) |
Prime | WMT16 English-German | ende_prime_avg.pt | newstest2014 (shared vocab): download (.tar.bz2) |
Prime | WMT14 English-French | enfr_prime_single_check.pt | newstest2014: download (.tar.bz2) |
Place the models into directory of checkpoint and evaluate the models
export CUDA_VISIBLE_DEVICES=0
python3 generate.py data-bin/wmt14_en_fr --path checkpoint/enfr_prime_single_check.pt --batch-size 64 --beam 5 \
--remove-bpe --lenpen 0.8 --gen-subset test --quiet > results/enfr_prime_single_check.txt
python3 generate.py data-bin/wmt14_en_fr --path checkpoint/enfr_prime_single_check.pt --batch-size 64 --beam 5 \
--remove-bpe --lenpen 0.8 --gen-subset valid --quiet > results/enfr_prime_single_check.txt
export CUDA_VISIBLE_DEVICES=0
python3 generate.py data-bin/wmt16_en_de_bpe32k --path checkpoint/prime_avg70.pt --batch-size 128 --beam 4 --remove-bpe --lenpen 0.6 --gen-subset test --quiet > results/ende_prime_avg_test.txt
export CUDA_VISIBLE_DEVICES=0
# expect 35.7
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/prime_checkpoint70.pt --batch-size 128 --beam 5 --remove-bpe --gen-subset test --quiet > results/iwslt_deen_prime_checkpoint70_test.txt
# expect 36.2
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/ende_prime_avg.pt --batch-size 128 --beam 5 --remove-bpe --gen-subset test --quiet > results/iwslt_deen_prime_avg70_test.txt
Please follow the instructions in examples/translation/README.md
to preprocess the data.
For best BLEU results, lenpen, beam size and checkpoints to average may need to be manually tuned.
Training Log is examples/parallel_intersected_multi-scale_attention(Prime)/logs/enfr_muse.txt Training and evaluating Prime on WMT16 En-Fr using cosine scheduler on one machine with 8 RTX 2080ti GPUs(11GB):
# Training
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
save=enfr_prime_fp
blocks=12
dim=768
inner_dim=$((4*$dim))
cur_save=${save}
attn_dynamic_cat=1
attn_dynamic_type=2
kernel_size=0
python3 train.py data-bin/wmt14_en_fr \
--arch transformer_vaswani_wmt_en_fr_big --share-all-embeddings --ddp-backend=no_c10d \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-shrink 1 --max-lr 0.0007 --lr 1e-7 --min-lr 1e-9 --lr-scheduler cosine --warmup-init-lr 1e-7 \
--warmup-updates 10000 --t-mult 1 --lr-period-updates 70000 \
--dropout 0.1 --attention-dropout 0.1 --weight-dropout 0.1 --input_dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 1280 --max-epoch 39 \
--combine 1 --encoder-layers ${blocks} --decoder-layers ${blocks} --encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim}\
--attn_dynamic_type ${attn_dynamic_type} --kernel_size ${kernel_size} --attn_dynamic_cat ${attn_dynamic_cat} --attn_wide_kernels [3,15] --dynamic_gate 1 \
--update-freq 76 --fp16 --tensorboard-logdir checkpoint/${cur_save} --log-format json --save-dir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
# Evaluation
export CUDA_VISIBLE_DEVICES=0
python3 generate.py data-bin/wmt14_en_fr --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 64 --beam 5 \
--remove-bpe --lenpen 0.8 --gen-subset test --quiet > results/${cur_save}.txt
python3 generate.py data-bin/wmt14_en_fr --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 64 --beam 5 \
--remove-bpe --lenpen 0.8 --gen-subset valid --quiet > results/${cur_save}_valid.txt
Training and evaluating Prime on WMT16 En-Fr using cosine scheduler on one machine with 4 RTX TITAN GPUs(24GB):
# Training
export CUDA_VISIBLE_DEVICES=0,1,2,3
save=enfr_prime_fp
blocks=12
dim=768
inner_dim=$((4*$dim))
cur_save=${save}
attn_dynamic_cat=1
attn_dynamic_type=2
kernel_size=0
python3 train.py data-bin/wmt14_en_fr \
--arch transformer_vaswani_wmt_en_fr_big --share-all-embeddings --ddp-backend=no_c10d \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-shrink 1 --max-lr 0.0007 --lr 1e-7 --min-lr 1e-9 --lr-scheduler cosine --warmup-init-lr 1e-7 \
--warmup-updates 10000 --t-mult 1 --lr-period-updates 70000 \
--dropout 0.1 --attention-dropout 0.1 --weight-dropout 0.1 --input_dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 5120 --max-epoch 39 \
--combine 1 --encoder-layers ${blocks} --decoder-layers ${blocks} --encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim}\
--attn_dynamic_type ${attn_dynamic_type} --kernel_size ${kernel_size} --attn_dynamic_cat ${attn_dynamic_cat} --attn_wide_kernels [3,15] --dynamic_gate 1 \
--update-freq 32 --fp16 --tensorboard-logdir checkpoint/${cur_save} --log-format json --save-dir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
# Evaluation
export CUDA_VISIBLE_DEVICES=0
python3 generate.py data-bin/wmt14_en_fr --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 \
--remove-bpe --lenpen 0.8 --gen-subset valid --quiet > results/${cur_save}_valid.txt
python3 generate.py data-bin/wmt14_en_fr --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 \
--remove-bpe --lenpen 0.8 --gen-subset test --quiet > results/${cur_save}.txt
Training and evaluating Prime on WMT16 En-De using cosine scheduler on one machine with 4 Titan RTX gpus:
# Training
export CUDA_VISIBLE_DEVICES=0,1,2,3
save=ende_prime_fp
blocks=12
dim=768
inner_dim=$((4*$dim))
cur_save=${save}
attn_dynamic_cat=1
attn_dynamic_type=2
kernel_size=0
python3 train.py data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings --ddp-backend=no_c10d \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--ddp-backend=no_c10d --max-tokens 3584 \
--lr-scheduler cosine --warmup-updates 10000 --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 --t-mult 1 --lr-period-updates 20000 \
--dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 --input_dropout 0.1 \
--fp16 \
--combine 1 --encoder-layers ${blocks} --decoder-layers ${blocks} --encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim}\
--attn_dynamic_type ${attn_dynamic_type} --kernel_size ${kernel_size} --attn_dynamic_cat ${attn_dynamic_cat} --attn_wide_kernels [3,15] --dynamic_gate 1 \
--update-freq 32 --tensorboard-logdir checkpoint/${cur_save} --log-format json --save-dir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
# Evaluation
python3 generate.py data-bin/wmt16_en_de_bpe32k --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 --remove-bpe --lenpen 0.6 --gen-subset test > results/${cur_save}_checkpoint_best.txt
# Average, generate and compound spliting
Training and evaluating the Prime-simple on WMT16 En-De using invert-sqrt scheduler on one machine with 4 Titan RTX gpus:
# Training
export CUDA_VISIBLE_DEVICES=0,1,2,3
save=ende_prime_simple_fp
blocks=12
dim=768
inner_dim=$((4*$dim))
lr=0.001
cur_save=${save}
python3 train.py data-bin/wmt16_en_de_bpe32k \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings --ddp-backend=no_c10d \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr ${lr} --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 --max-epoch 75 \
--combine 1 --encoder-layers ${blocks} --decoder-layers ${blocks} --encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim}\
--fp16 --update-freq 40 --tensorboard-logdir checkpoint/${cur_save} --log-format json --save-dir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
# Evaluation
python3 generate.py data-bin/wmt16_en_de_bpe32k --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 --remove-bpe --lenpen 0.6 --gen-subset test > results/${cur_save}_checkpoint_best.txt
# Average, generate and compound spliting
Training log is Training Log is examples/parallel_intersected_multi-scale_attention(Prime)/logs/iwslt14_de-en_log.txt Prediction is examples/parallel_intersected_multi-scale_attention(Prime)/logs/iwslt14_de-en_log.txt Expected ppl should be 4.6+, and the BLEU score for checkpoint best should be around 35.7 Training and evaluating Prime on a GPU:
# Training
export CUDA_VISIBLE_DEVICES=0
save=deen_prime_fp
blocks=12
dim=384
inner_dim=$((2*$dim))
attn_dynamic_cat=1
attn_dynamic_type=2
kernel_size=0
cur_save=${save}
for seed in 1
do
python3 train.py data-bin/iwslt14.tokenized.de-en -a transformer_iwslt_de_en --optimizer adam --lr 0.001 -s de -t en --label-smoothing 0.1 --dropout 0.4 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 20000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' --update-freq 4 --keep-last-epochs 30 \
--adam-betas '(0.9, 0.98)' --save-dir checkpoint/${cur_save} \
--attn_dynamic_type ${attn_dynamic_type} --kernel_size ${kernel_size} --attn_dynamic_cat ${attn_dynamic_cat} --attn_wide_kernels [3,15] --fp16 --dynamic_gate 1\
--combine 1 --encoder-layers ${blocks} --decoder-layers ${blocks} \
--encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim} --seed ${seed} \
--log-format json --tensorboard-logdir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
# Evaluation
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 --remove-bpe > results/${cur_save}_checkpoint_best.txt
python3 average_checkpoints.py --inputs checkpoint/${cur_save} --num-epoch-checkpoints 10 --output checkpoint/${cur_save}/avg_final.pt
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/${cur_save}/avg_final.pt --batch-size 128 --beam 5 --remove-bpe > results/${cur_save}_avg_final.txt
done
Training and evaluating Prime-simple on a GPU:
# Training
export CUDA_VISIBLE_DEVICES=0
save=deen_prime_simple_fp
blocks=12
dim=384
inner_dim=$((2*$dim))
cur_save=${save}
python3 train.py data-bin/iwslt14.tokenized.de-en -a transformer_iwslt_de_en --optimizer adam --lr 0.001 -s de -t en --label-smoothing 0.1 --dropout 0.4 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 20000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' --update-freq 4 --keep-last-epochs 30 \
--adam-betas '(0.9, 0.98)' --save-dir checkpoint/${cur_save} --fp16 \
--combine 1 --encoder-layers ${blocks} --decoder-layers ${blocks} \
--encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim} --seed ${seed} \
--log-format json --tensorboard-logdir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
# Evaluation
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 --remove-bpe > results/${cur_save}_checkpoint_best.txt
python3 average_checkpoints.py --inputs checkpoint/${cur_save} --num-epoch-checkpoints 10 --output checkpoint/${cur_save}/avg_final.pt
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/${cur_save}/avg_final.pt --batch-size 128 --beam 5 --remove-bpe > results/${cur_save}_avg_final.txt
save=deen_prime_simple_fp
export CUDA_VISIBLE_DEVICES=0
blocks=6
dim=512
inner_dim=$((2*$dim))
results_name=bm_speed
for seed in 1
do
cur_save=${save}_bm_s${seed}
python3 train.py data-bin/iwslt14.tokenized.de-en -a transformer_iwslt_de_en --optimizer adam --lr 0.001 -s de -t en --label-smoothing 0.1 --dropout 0.4 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-updates 20000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' --update-freq 4 --keep-last-epochs 30 \
--adam-betas '(0.9, 0.98)' --save-dir checkpoint/${cur_save} --seed ${seed} --fp16 \
--bm 1 --bm_in_a 3 --bm_out_a 0 \
--encoder-layers ${layers} --decoder-layers ${layers} \
--encoder-embed-dim ${dim} --encoder-ffn-embed-dim ${inner_dim} --decoder-embed-dim ${dim} --decoder-ffn-embed-dim ${inner_dim} \
--log-format json --tensorboard-logdir checkpoint/${cur_save} 2>&1 | tee checkpoint/${cur_save}.txt
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/${cur_save}/checkpoint_best.pt --batch-size 128 --beam 5 --remove-bpe > results/${cur_save}_checkpoint_best.txt
python3 average_checkpoints.py --inputs checkpoint/$cur_save --num-epoch-checkpoints 10 --output checkpoint/$cur_save/avg_final.pt
python3 generate.py data-bin/iwslt14.tokenized.de-en --path checkpoint/$cur_save/avg_final.pt --batch-size 1 --beam 5 --remove-bpe --quiet > results/${results_name}/${cur_save}_test.txt
done