ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation

Zero-shot Sample Generation from ZmBART for NHG Task in Hindi

About

This repository contains the source code of the paper titled ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation which has been accepted in the Findings of the Association of Computational Linguistics (ACL 2021) conference. For the implementation, we have modified the fairseq library. If you have any questions, please feel free to create a Github issue or reach out to the first author at [email protected].

All necessary downloads are available at this link.

Installation Instructions

To install all the dependencies, please run the following conda command:

conda env create --file environment.yml
conda activate py37_ZmBART

We have tested the code with Python=3.7 andPyTorch==1.8

Please install SentencePiece (SPM) from here. Make sure the binary is located at /usr/local/bin/spm_encode.

Other Setups

The ZmBART checkpoint can be downloaded from here and should be extracted to the ZmBART/checkpoints/ directory.
The mBART.CC25 checkpoint can be downloaded and extracted to the ZmBART/ directory using the following commands:

wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz
tar -xzvf mbart.CC25.tar.gz

The raw training, validation, and evaluation datasets for English, Hindi, and Japanese can be downloaded from here. The model also requires an SPM tokenized as input. Instructions for converting the raw dataset to an SPM tokenized dataset can be found here and here, or you can directly download the preprocessed datasets that we used in the next point. -The SPM tokenized training, validation, and evaluation datasets for English, Hindi, and Japanese can be downloaded from here. These should be extracted to the ZmBART/dataset/preprocess/ directory.
Note that we added a training and validation dataset for Hindi and Japanese for few-shot training. The validation data is optional.
For ATS task, we did joint multilingual training (see the paper for more details), so 500 monolingual datasets are augmented.

Training and Generation

Step-1: ZmBART Checkpoint

To create the training dataset for the auxiliary task, use the denoising_Script.ipynb script. Fine-tune the mBART model with the auxiliary training dataset to obtain the ZmBART checkpoint as follows:

PRETRAIN=../mbart.cc25/model.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/auxi/en-hi
SAVEDIR=../checkpoint/checkpoint_ZmBART.pt

python -u train.py ${DATADIR} \
--arch mbart_large \
--task translation_from_pretrained_bart \
--source-lang ${SRC} \
--target-lang ${TGT} \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2  \
--optimizer adam \
--adam-eps 1e-06 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--warmup-updates 2500 \
--max-update 100000 \
--dropout 0.3 \
--max-tokens 2048 \
--update-freq 2 \
--save-interval 1 \
--save-interval-updates 50000 \
--keep-interval-updates 10 \
--seed 222 \
--log-interval 100 \
--restore-file $PRETRAIN \
--langs $langs \
--save-dir ${SAVEDIR} \
--fp16 \
--skip-invalid-size-inputs-valid-test \

Here, we use dummy source and target languages as en and hi. However, in our case, the source and target are the same language. All the shell scripts are available at ZmBART/fairseq_cli/.

In the rest of this section, we will show the modeling pipeline to fine-tune ZmBART for the New Headline Generation (NHG) task and generate headlines in low-resource languages.

Step-2: Binarization of NHG English training data

DATA=../dataset/preprocess/NHG
TRAIN=nhg_en_train
VALID=nhg_en_valid
TEST=nhg_en_test
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DEST=../dataset/postprocess/NHG
DICT=../mbart.cc25/dict.txt

python -u preprocess.py \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm  \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70 \
--fp16

Step-3: Fine-tuning of ZmBART

PRETRAIN=../checkpoint/checkpoint_ZmBART.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
SAVEDIR=../checkpoint

python -u train.py ${DATADIR} \
--arch mbart_large \
--task translation_from_pretrained_bart \
--source-lang ${SRC} \
--target-lang ${TGT} \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2  \
--optimizer adam \
--adam-eps 1e-06 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--warmup-updates 2500 \
--max-update 100000 \
--dropout 0.3 \
--max-tokens 2048 \
--update-freq 2 \
--save-interval 1 \
--save-interval-updates 50000 \
--keep-interval-updates 10 \
--seed 222 \
--log-interval 100 \
--restore-file $PRETRAIN \
--langs $langs \
--save-dir ${SAVEDIR} \
--fp16 \
--skip-invalid-size-inputs-valid-test \

After fine-tuning the ZmBART model with the English NHG dataset, we conducted zero-shot evaluation for the Hindi language. The details are as follows:

Step-4: Binarization of HindI NHG test data

DATA=../dataset/preprocess/NHG
TRAIN=nhg_en_train
VALID=nhg_en_valid
TEST=nhg_hi_test
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DEST=../dataset/postprocess/NHG
DICT=../mbart.cc25/dict.txt

python -u preprocess.py \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm  \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70 \
--fp16

Step-5: Zero-shot Generation in Hindi and Evaluation

export CUDA_VISIBLE_DEVICES=7
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
model=../checkpoint/checkpoint_best.pt
BPEDIR=../mbart.cc25/sentence.bpe.model
PREDICTIONS_DIR=./outputs

python -u generate.py ${DATADIR} \
  --path $model \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  -s en_XX \
  -t hi_IN \
  --bpe 'sentencepiece' \
  --sentencepiece-model $BPEDIR \
  --sacrebleu \
  --remove-bpe 'sentencepiece'\
  --max-sentences 64 \
  --fp16 \
  --max-len-b 100 \
  --skip-invalid-size-inputs-valid-test \
  --langs $langs > $PREDICTIONS_DIR/out.txt \

### Creating hypothesis and Reference files
grep ^H $PREDICTIONS_DIR/out.txt | cut -f3- > $PREDICTIONS_DIR/en.hyp.txt
grep ^T $PREDICTIONS_DIR/out.txt | cut -f2- > $PREDICTIONS_DIR/en.ref.txt

### Hindi Language Tokenization and Evaluation
python hindi_tok.py $PREDICTIONS_DIR/en.ref.txt $PREDICTIONS_DIR/en.hyp.txt
wc -l $PREDICTIONS_DIR/hindi_hyp.txt
wc -l $PREDICTIONS_DIR/hindi_ref.txt
sacrebleu -tok 'none' -s 'none' $PREDICTIONS_DIR/hindi_ref.txt < $PREDICTIONS_DIR/hindi_hyp.txt
python rouge_test.py $PREDICTIONS_DIR/hindi_hyp.txt $PREDICTIONS_DIR/hindi_ref.txt
bert-score -r $PREDICTIONS_DIR/hindi_ref.txt -c $PREDICTIONS_DIR/hindi_hyp.txt --lang hi

Few-shot training for News Headline Generation task in Hindi Language

Step-01 to Step-03 are similar to the Zero-shot setting.

Step-4: Binarization of Hindi training, validation, and test dataset

DATA=../dataset/preprocess/NHG
TRAIN=nhg_hi_train
VALID=nhg_hi_valid
TEST=nhg_hi_test
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DEST=../dataset/postprocess/NHG
DICT=../mbart.cc25/dict.txt

python -u preprocess.py \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm  \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70 \
--fp16

Step-5: Few-shot fine-tuning for Hindi Language

export CUDA_VISIBLE_DEVICES=4,5,6,7
PRETRAIN=../checkpoint/checkpoint_best.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
SAVEDIR=../checkpoint

python -u train.py ${DATADIR} \
--arch mbart_large \
--task translation_from_pretrained_bart \
--source-lang ${SRC} \
--target-lang ${TGT} \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.2  \
--optimizer adam \
--adam-eps 1e-06 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--warmup-updates 2500 \
--max-update 100000 \
--dropout 0.3 \
--max-tokens 2048 \
--update-freq 2 \
--save-interval 1 \
--save-interval-updates 50000 \
--keep-interval-updates 10 \
--seed 222 \
--log-interval 100 \
--restore-file $PRETRAIN \
--langs $langs \
--save-dir ${SAVEDIR} \
--fp16 \
--skip-invalid-size-inputs-valid-test \

Step-6: Few-shot Generation in Hindi and Evaluation

langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
SRC=en_XX
TGT=hi_IN
NAME=en-hi
DATADIR=../dataset/postprocess/NHG/en-hi
model=../checkpoint/checkpoint_best.pt
BPEDIR=../mbart.cc25/sentence.bpe.model
PREDICTIONS_DIR=./outputs

python -u generate.py ${DATADIR} \
  --path $model \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  -s en_XX \
  -t hi_IN \
  --bpe 'sentencepiece' \
  --sentencepiece-model $BPEDIR \
  --sacrebleu \
  --remove-bpe 'sentencepiece'\
  --max-sentences 64 \
  --fp16 \
  --max-len-b 100 \
  --skip-invalid-size-inputs-valid-test \
  --langs $langs > $PREDICTIONS_DIR/out.txt \

### Creating hypothesis and Reference files
grep ^H $PREDICTIONS_DIR/out.txt | cut -f3- > $PREDICTIONS_DIR/en.hyp.txt
grep ^T $PREDICTIONS_DIR/out.txt | cut -f2- > $PREDICTIONS_DIR/en.ref.txt

### Hindi Language Tokenization and Evaluation
python hindi_tok.py $PREDICTIONS_DIR/en.ref.txt $PREDICTIONS_DIR/en.hyp.txt
wc -l $PREDICTIONS_DIR/hindi_hyp.txt
wc -l $PREDICTIONS_DIR/hindi_ref.txt
sacrebleu -tok 'none' -s 'none' $PREDICTIONS_DIR/hindi_ref.txt < $PREDICTIONS_DIR/hindi_hyp.txt
python rouge_test.py $PREDICTIONS_DIR/hindi_hyp.txt $PREDICTIONS_DIR/hindi_ref.txt
bert-score -r $PREDICTIONS_DIR/hindi_ref.txt -c $PREDICTIONS_DIR/hindi_hyp.txt --lang hi

License

This project is licensed under the MIT License. Please see fairseq's LICENSE file for more details.

Citation

@inproceedings{maurya-etal-2021-zmbart,
    title = "{Z}m{BART}: An Unsupervised Cross-lingual Transfer Framework for Language Generation",
    author = "Maurya, Kaushal Kumar  and
      Desarkar, Maunendra Sankar  and
      Kano, Yoshinobu  and
      Deepshikha, Kumari",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.248",
    doi = "10.18653/v1/2021.findings-acl.248",
    pages = "2804--2818",
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
build		build
docs		docs
examples		examples
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Code_readMe		Code_readMe
LICENSE		LICENSE
README.md		README.md
ZmBART.png		ZmBART.png
denoising_Script.ipynb		denoising_Script.ipynb
environment.yml		environment.yml
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
sample1.png		sample1.png
sample_zmbart.png		sample_zmbart.png
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation

Zero-shot Sample Generation from ZmBART for NHG Task in Hindi

About

Installation Instructions

Other Setups

Training and Generation

Step-1: ZmBART Checkpoint

Step-2: Binarization of NHG English training data

Step-3: Fine-tuning of ZmBART

Step-4: Binarization of HindI NHG test data

Step-5: Zero-shot Generation in Hindi and Evaluation

Few-shot training for News Headline Generation task in Hindi Language

Step-4: Binarization of Hindi training, validation, and test dataset

Step-5: Few-shot fine-tuning for Hindi Language

Step-6: Few-shot Generation in Hindi and Evaluation

License

Citation

About

Releases

Packages

Languages

License

kaushal0494/ZmBART

Folders and files

Latest commit

History

Repository files navigation

ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation

Zero-shot Sample Generation from ZmBART for NHG Task in Hindi

About

Installation Instructions

Other Setups

Training and Generation

Step-1: ZmBART Checkpoint

Step-2: Binarization of NHG English training data

Step-3: Fine-tuning of ZmBART

Step-4: Binarization of HindI NHG test data

Step-5: Zero-shot Generation in Hindi and Evaluation

Few-shot training for News Headline Generation task in Hindi Language

Step-4: Binarization of Hindi training, validation, and test dataset

Step-5: Few-shot fine-tuning for Hindi Language

Step-6: Few-shot Generation in Hindi and Evaluation

License

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages