This repository contains the source code of the paper titled ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation which has been accepted in the Findings of the Association of Computational Linguistics (ACL 2021) conference. For the implementation, we have modified the fairseq library. If you have any questions, please feel free to create a Github issue or reach out to the first author at [email protected].
All necessary downloads are available at this link.
To install all the dependencies, please run the following conda command:
conda env create --file environment.yml
conda activate py37_ZmBART
We have tested the code with Python=3.7
andPyTorch==1.8
Please install SentencePiece (SPM) from here. Make sure the binary is located at /usr/local/bin/spm_encode
.
- The ZmBART checkpoint can be downloaded from here and should be extracted to the
ZmBART/checkpoints/
directory. - The mBART.CC25 checkpoint can be downloaded and extracted to the
ZmBART/
directory using the following commands:
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz
tar -xzvf mbart.CC25.tar.gz
- The raw training, validation, and evaluation datasets for English, Hindi, and Japanese can be downloaded from here. The model also requires an SPM tokenized as input. Instructions for converting the raw dataset to an SPM tokenized dataset can be found here and here, or you can directly download the preprocessed datasets that we used in the next point.
-The SPM tokenized training, validation, and evaluation datasets for English, Hindi, and Japanese can be downloaded from here. These should be extracted to the
ZmBART/dataset/preprocess/
directory. - Note that we added a training and validation dataset for Hindi and Japanese for few-shot training. The validation data is optional.
- For ATS task, we did joint multilingual training (see the paper for more details), so 500 monolingual datasets are augmented.
To create the training dataset for the auxiliary task, use the denoising_Script.ipynb
script. Fine-tune the mBART model with the auxiliary training dataset to obtain the ZmBART checkpoint as follows:
PRETRAIN=../mbart.cc25/model.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/auxi/en-hi
SAVEDIR=../checkpoint/checkpoint_ZmBART.pt
python -u train.py ${DATADIR} \
--arch mbart_large \
--task translation_from_pretrained_bart \
--source-lang ${SRC} \
--target-lang ${TGT} \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2 \
--optimizer adam \
--adam-eps 1e-06 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--warmup-updates 2500 \
--max-update 100000 \
--dropout 0.3 \
--max-tokens 2048 \
--update-freq 2 \
--save-interval 1 \
--save-interval-updates 50000 \
--keep-interval-updates 10 \
--seed 222 \
--log-interval 100 \
--restore-file $PRETRAIN \
--langs $langs \
--save-dir ${SAVEDIR} \
--fp16 \
--skip-invalid-size-inputs-valid-test \
Here, we use dummy source and target languages as en
and hi
. However, in our case, the source and target are the same language. All the shell scripts are available at ZmBART/fairseq_cli/
.
In the rest of this section, we will show the modeling pipeline to fine-tune ZmBART for the New Headline Generation (NHG) task and generate headlines in low-resource languages.
DATA=../dataset/preprocess/NHG
TRAIN=nhg_en_train
VALID=nhg_en_valid
TEST=nhg_en_test
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DEST=../dataset/postprocess/NHG
DICT=../mbart.cc25/dict.txt
python -u preprocess.py \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70 \
--fp16
PRETRAIN=../checkpoint/checkpoint_ZmBART.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
SAVEDIR=../checkpoint
python -u train.py ${DATADIR} \
--arch mbart_large \
--task translation_from_pretrained_bart \
--source-lang ${SRC} \
--target-lang ${TGT} \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2 \
--optimizer adam \
--adam-eps 1e-06 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--warmup-updates 2500 \
--max-update 100000 \
--dropout 0.3 \
--max-tokens 2048 \
--update-freq 2 \
--save-interval 1 \
--save-interval-updates 50000 \
--keep-interval-updates 10 \
--seed 222 \
--log-interval 100 \
--restore-file $PRETRAIN \
--langs $langs \
--save-dir ${SAVEDIR} \
--fp16 \
--skip-invalid-size-inputs-valid-test \
After fine-tuning the ZmBART model with the English NHG dataset, we conducted zero-shot evaluation for the Hindi language. The details are as follows:
DATA=../dataset/preprocess/NHG
TRAIN=nhg_en_train
VALID=nhg_en_valid
TEST=nhg_hi_test
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DEST=../dataset/postprocess/NHG
DICT=../mbart.cc25/dict.txt
python -u preprocess.py \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70 \
--fp16
export CUDA_VISIBLE_DEVICES=7
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
model=../checkpoint/checkpoint_best.pt
BPEDIR=../mbart.cc25/sentence.bpe.model
PREDICTIONS_DIR=./outputs
python -u generate.py ${DATADIR} \
--path $model \
--task translation_from_pretrained_bart \
--gen-subset test \
-s en_XX \
-t hi_IN \
--bpe 'sentencepiece' \
--sentencepiece-model $BPEDIR \
--sacrebleu \
--remove-bpe 'sentencepiece'\
--max-sentences 64 \
--fp16 \
--max-len-b 100 \
--skip-invalid-size-inputs-valid-test \
--langs $langs > $PREDICTIONS_DIR/out.txt \
### Creating hypothesis and Reference files
grep ^H $PREDICTIONS_DIR/out.txt | cut -f3- > $PREDICTIONS_DIR/en.hyp.txt
grep ^T $PREDICTIONS_DIR/out.txt | cut -f2- > $PREDICTIONS_DIR/en.ref.txt
### Hindi Language Tokenization and Evaluation
python hindi_tok.py $PREDICTIONS_DIR/en.ref.txt $PREDICTIONS_DIR/en.hyp.txt
wc -l $PREDICTIONS_DIR/hindi_hyp.txt
wc -l $PREDICTIONS_DIR/hindi_ref.txt
sacrebleu -tok 'none' -s 'none' $PREDICTIONS_DIR/hindi_ref.txt < $PREDICTIONS_DIR/hindi_hyp.txt
python rouge_test.py $PREDICTIONS_DIR/hindi_hyp.txt $PREDICTIONS_DIR/hindi_ref.txt
bert-score -r $PREDICTIONS_DIR/hindi_ref.txt -c $PREDICTIONS_DIR/hindi_hyp.txt --lang hi
Step-01 to Step-03 are similar to the Zero-shot setting.
DATA=../dataset/preprocess/NHG
TRAIN=nhg_hi_train
VALID=nhg_hi_valid
TEST=nhg_hi_test
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DEST=../dataset/postprocess/NHG
DICT=../mbart.cc25/dict.txt
python -u preprocess.py \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70 \
--fp16
export CUDA_VISIBLE_DEVICES=4,5,6,7
PRETRAIN=../checkpoint/checkpoint_best.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
SAVEDIR=../checkpoint
python -u train.py ${DATADIR} \
--arch mbart_large \
--task translation_from_pretrained_bart \
--source-lang ${SRC} \
--target-lang ${TGT} \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2 \
--optimizer adam \
--adam-eps 1e-06 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--warmup-updates 2500 \
--max-update 100000 \
--dropout 0.3 \
--max-tokens 2048 \
--update-freq 2 \
--save-interval 1 \
--save-interval-updates 50000 \
--keep-interval-updates 10 \
--seed 222 \
--log-interval 100 \
--restore-file $PRETRAIN \
--langs $langs \
--save-dir ${SAVEDIR} \
--fp16 \
--skip-invalid-size-inputs-valid-test \
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
model=../checkpoint/checkpoint_best.pt
BPEDIR=../mbart.cc25/sentence.bpe.model
PREDICTIONS_DIR=./outputs
python -u generate.py ${DATADIR} \
--path $model \
--task translation_from_pretrained_bart \
--gen-subset test \
-s en_XX \
-t hi_IN \
--bpe 'sentencepiece' \
--sentencepiece-model $BPEDIR \
--sacrebleu \
--remove-bpe 'sentencepiece'\
--max-sentences 64 \
--fp16 \
--max-len-b 100 \
--skip-invalid-size-inputs-valid-test \
--langs $langs > $PREDICTIONS_DIR/out.txt \
### Creating hypothesis and Reference files
grep ^H $PREDICTIONS_DIR/out.txt | cut -f3- > $PREDICTIONS_DIR/en.hyp.txt
grep ^T $PREDICTIONS_DIR/out.txt | cut -f2- > $PREDICTIONS_DIR/en.ref.txt
### Hindi Language Tokenization and Evaluation
python hindi_tok.py $PREDICTIONS_DIR/en.ref.txt $PREDICTIONS_DIR/en.hyp.txt
wc -l $PREDICTIONS_DIR/hindi_hyp.txt
wc -l $PREDICTIONS_DIR/hindi_ref.txt
sacrebleu -tok 'none' -s 'none' $PREDICTIONS_DIR/hindi_ref.txt < $PREDICTIONS_DIR/hindi_hyp.txt
python rouge_test.py $PREDICTIONS_DIR/hindi_hyp.txt $PREDICTIONS_DIR/hindi_ref.txt
bert-score -r $PREDICTIONS_DIR/hindi_ref.txt -c $PREDICTIONS_DIR/hindi_hyp.txt --lang hi
This project is licensed under the MIT License. Please see fairseq's LICENSE file for more details.
@inproceedings{maurya-etal-2021-zmbart,
title = "{Z}m{BART}: An Unsupervised Cross-lingual Transfer Framework for Language Generation",
author = "Maurya, Kaushal Kumar and
Desarkar, Maunendra Sankar and
Kano, Yoshinobu and
Deepshikha, Kumari",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.248",
doi = "10.18653/v1/2021.findings-acl.248",
pages = "2804--2818",
}