A Template-based Method for Constrained Neural Machine Translation (EMNLP 2022)

Introduction

We propose a template-based framework that can cope with both lexical and structural constraints for neural machine translation. We use fairseq to implement our method, and the commit code is d5f7b50 (Mon Sep 21 13:45:35 2020 -0700). Our major modification is that we update the generate.py to achieve prefix-based inference.

Lexically Constrained Translation

Data

We use WMT17 En-Zh and WMT20 En-De as our training sets. We use En-Zh and En-De alignment test sets for evaluation. We follow Chen et al., (2021) to extract the constraints.

Build the templates

We use the script prepare_template.py to convert natural language sentences into template-based ones.

Prepare the template-based training data:

python prepare_template.py en zh train.bpe train.bpe.const.en-zh t2s

Prepare the template-based inference data:

python prepare_template.py en zh test.bpe test.bpe.const.en-zh t2s.infer

Here is an example:

Filename	Example
en	this approach , also adopted by italy , contributes to reducing costs and increasing the overall volume of funds for beneficiaries .
zh	这种方式也被意大利所采用。它有助于降低成本 , 增加受益人收到的汇款总额。
const.en-zh	italy \|\|\| 意大利 \|\|\| approach \|\|\| 方式 \|\|\| increasing \|\|\| 增加 \|\|\|
t2s.en	[C0] italy [C1] approach [C2] increasing [SEP] [S0] [C1] [S1] [C0] [S2] [C2] [S3] [SEP] [S0] this [S1] , also adopted by [S2] , contributes to reducing costs and [S3] the overall volume of funds for beneficiaries .
t2s.zh	[C0] 意大利 [C1] 方式 [C2] 增加 [SEP] [T0] [C1] [T1] [C0] [T2] [C2] [T3] [SEP] [T0] 这种 [T1] 也被 [T2] 所采用。它有助于降低成本 , [T3] 受益人收到的汇款总额。
t2s.infer.en	[C0] italy [C1] approach [C2] increasing [SEP] [S0] [C1] [S1] [C0] [S2] [C2] [S3] [SEP] [S0] this [S1] , also adopted by [S2] , contributes to reducing costs and [S3] the overall volume of funds for beneficiaries .
t2s.infer.zh	[C0] 意大利 [C1] 方式 [C2] 增加 [SEP]

Training

Step 1: Binarize the training corpus

fairseq-preprocess \
    -s $src -t $tgt \
    --joined-dictionary --srcdict dict.${src}-${tgt}.txt \
    --trainpref train.bpe.t2s \
    --validpref valid.bpe.t2s \
    --destdir train-const-bin \
    --workers 128

Step 2: Model training

En-Zh & Zh-En

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train train-const-bin \
    --fp16 --seed 32 --ddp-backend no_c10d \
    -s $src -t $tgt \
    --lr-scheduler inverse_sqrt --lr 0.0007 \
    --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --max-update 200000 \
    --weight-decay 0.0 --clip-norm 0.0 --dropout 0.1 \
    --max-tokens 8192 --update-freq 1 \
    --arch transformer --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --save-dir ckpts \
    --tensorboard-logdir logs \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --no-progress-bar --log-format simple --log-interval 10 \
    --no-epoch-checkpoints \
    --save-interval-updates 1000 --keep-interval-updates 5 \
    |& tee -a logs/train.log

En-De & De-En

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train train-const-bin \
    --fp16 --seed 32 --ddp-backend no_c10d \
    -s $src -t $tgt \
    --lr-scheduler inverse_sqrt --lr 0.0005 \
    --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --max-update 300000 \
    --weight-decay 0.0 --clip-norm 0.0 --dropout 0.1 \
    --max-tokens 4096 --update-freq 1 \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --save-dir ckpts \
    --tensorboard-logdir logs \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --no-progress-bar --log-format simple --log-interval 10 \
    --no-epoch-checkpoints \
    --save-interval-updates 1000 --keep-interval-updates 5 \
    |& tee -a logs/train.log

Inference

Step 1: Binarize the inference data

fairseq-preprocess \
    -s $src -t $tgt \
    --joined-dictionary --srcdict dict.${src}-${tgt}.txt \
    --testpref test.bpe.t2s.infer \
    --destdir infer-const-bin

Step 2: Model inference

genout=test
CUDA_VISIBLE_DEVICES=0 fairseq-generate infer-const-bin \
    -s $src -t $tgt \
    --gen-subset test \
    --path ${src}-${tgt}.pt \
    --lenpen 1.0 --beam 4 \
    --batch-size 128 \
    --prefix-size 1024 \
    > $genout

Step 3: Post-process

We use the script restore_template.py to convert the template-based model output into natural language sentences.

grep ^H $genout |\
    sed 's/H-//g' | sort -k 1 -n -t ' ' | awk -F'\t' '{print $3}' > $genout.sort

python restore_template.py test.bpe.t2s.infer.$tgt $genout.sort $genout.bpe

cat $genout.bpe |\
    sed -r 's/(@@ )|(@@ ?$)//g' | tee $genout.tok |\
    perl mosesdecoder/scripts/tokenizer/detokenizer.perl -l $tgt > $genout.detok

Step 4: Evaluate

Calculate BLEU:

# English and German
cat $genout.detok | sacrebleu test.detok.$tgt | tee $genout.bleu
# Chinese
cat $genout.detok | sacrebleu -tok zh test.detok.$tgt | tee $genout.bleu

Calculate Exact Match, Window Overlap, and 1-TERm using evaluate_term_plain.py, which is adapted from this repository:

python terminology_evaluation/evaluate_term_plain.py \
    --language $tgt \
    --source test.$src \
    --target_reference test.$tgt \
    --const test.const.${src}-${tgt} \
    --hypothesis $genout.tok

Structurally Constrained Translation

Data

We conduct structurally constrained translation experiments on the dataset provided by Hashimoto et al., (2019).

Build the templates

We use the script prepare_template.py to convert natural language sentences into template-based ones.

Prepare the template-based training data:

python prepare_template.py en zh train.tag.spm t2s

Prepare the template-based inference data:

python prepare_template.py en zh dev.tag.spm t2s.infer

Here is an example:

Filename	Example
en	Let’s look at how faceting works with the number and chart widgets that we added to our <ph> classic designer </ph> <ph> dashboard </ph> .
zh	让我们查看多面化如何使用添加到 <ph> 经典设计器 </ph> <ph> 仪表板 </ph> 的数字和图表小部件。
t2s.en	[S0] <ph> [S1] </ph> [S2] <ph> [S3] </ph> [S4] [SEP] [S0] Let’s look at how faceting works with the number and chart widgets that we added to our [S1] classic designer [S2] [S3] dashboard [S4] .
t2s.zh	[T0] <ph> [T1] </ph> [T2] <ph> [T3] </ph> [T4] [SEP] [T0] 让我们查看多面化如何使用添加到 [T1] 经典设计器 [T2] [T3] 仪表板 [T4] 的数字和图表小部件。
t2s.infer.en	[S0] <ph> [S1] </ph> [S2] <ph> [S3] </ph> [S4] [SEP] [S0] Let ’ s look at how faceting works with the number and chart widgets that we added to our [S1] classic designer [S2] [S3] dashboard [S4] .
t2s.infer.zh	[T0]

Training

Step 1: Binarize the training corpus

fairseq-preprocess \
    -s $src -t $tgt \
    --joined-dictionary --srcdict dict.${src}-${tgt}.txt \
    --trainpref train.tag.spm.t2s \
    --destdir train-bin \
    --workers 128

Step 2: Model training

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train train-bin \
    --fp16 --seed 32 --ddp-backend no_c10d \
    -s $src -t $tgt \
    --lr-scheduler cosine \
    --lr 1e-07 --max-lr 7e-4 \
    --warmup-init-lr 1e-07 --warmup-updates 8000 \
    --lr-shrink 1 --lr-period-updates 32000 \
    --max-update 40000 \
    --weight-decay 0.001 --clip-norm 0.0 \
    --dropout 0.2 --attention-dropout 0.2 --activation-dropout 0.2 \
    --max-tokens 8192 --update-freq 1 \
    --arch transformer --share-all-embeddings \
    --encoder-embed-dim 256 --decoder-embed-dim 256 \
    --encoder-ffn-embed-dim 1024 --decoder-ffn-embed-dim 1024 \
    --encoder-attention-heads 4 --decoder-attention-heads 4 \
    --encoder-layers 6 --decoder-layers 6 \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --save-dir ckpts \
    --tensorboard-logdir logs \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.2 \
    --no-progress-bar --log-format simple --log-interval 10 \
    --no-epoch-checkpoints \
    --save-interval-updates 500 --keep-interval-updates 5 \
    |& tee -a $logs/train.log

Inference

Step 1: Binarize the inference data

fairseq-preprocess \
    -s $src -t $tgt \
    --joined-dictionary --srcdict dict.${src}-${tgt}.txt \
    --testpref dev.tag.spm.t2s.infer \
    --destdir infer-bin

Step 2: Model inference

genout=output
CUDA_VISIBLE_DEVICES=0 fairseq-generate infer-bin \
    -s $src -t $tgt \
    --gen-subset test \
    --path ${src}-${tgt}.pt \
    --lenpen 1.0 --beam 4 \
    --prefix-size 1024 \
    --batch-size 128 > $genout

Step 3: Post-process

We use the script restore_template.py to convert the template-based model output into natural language sentences.

grep ^H $genout |\
    sed 's/H-//g' | sort -k 1 -n -t ' ' | awk -F'\t' '{print $3}' > $genout.sort
python restore_template.py $genout.sort $genout.spm
cat $genout.spm | spm_decode --model tag.spm.model > $genout.txt

Step 4: Evaluate

Prepare the json file using convert2json.py, which is released by this repository:

python convert2json.py \
    --input $genout.txt --output $genout.json \
    --lang $tgt --type translation --split dev

Perform the evaluation using evaluate.py, which is released by this repository:

python evaluate.py --target ${src}${tgt}_${tgt}_dev.json \
    --translation $genout.json | tee $genout.res

Contact

Please email [email protected] if you have any questions, suggestions, or bug reports :)

Citation

Please cite as:

@inproceedings{Wang:2022:TemplateNMT,
  title = {A Template-based Method for Constrained Neural Machine Translation},
  author = {Wang, Shuo and Li, Peng and Tan, Zhixing and Tu, Zhaopeng and Sun, Maosong and Liu, Yang},
  booktitle = {Proceedings of EMNLP 2022},
  year = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
fairseq		fairseq
lexical_scripts		lexical_scripts
structural_scripts		structural_scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Template-based Method for Constrained Neural Machine Translation (EMNLP 2022)

Introduction

Lexically Constrained Translation

Data

Build the templates

Training

Step 1: Binarize the training corpus

Step 2: Model training

En-Zh & Zh-En

En-De & De-En

Inference

Step 1: Binarize the inference data

Step 2: Model inference

Step 3: Post-process

Step 4: Evaluate

Structurally Constrained Translation

Data

Build the templates

Training

Step 1: Binarize the training corpus

Step 2: Model training

Inference

Step 1: Binarize the inference data

Step 2: Model inference

Step 3: Post-process

Step 4: Evaluate

Contact

Citation

About

Releases

Packages

Languages

THUNLP-MT/Template-NMT

Folders and files

Latest commit

History

Repository files navigation

A Template-based Method for Constrained Neural Machine Translation (EMNLP 2022)

Introduction

Lexically Constrained Translation

Data

Build the templates

Training

Step 1: Binarize the training corpus

Step 2: Model training

En-Zh & Zh-En

En-De & De-En

Inference

Step 1: Binarize the inference data

Step 2: Model inference

Step 3: Post-process

Step 4: Evaluate

Structurally Constrained Translation

Data

Build the templates

Training

Step 1: Binarize the training corpus

Step 2: Model training

Inference

Step 1: Binarize the inference data

Step 2: Model inference

Step 3: Post-process

Step 4: Evaluate

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages