This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at [email protected] for any question.
Please cite this paper if you use our code or data.
@InProceedings{clads-emnlp,
author = "Laura Perez-Beltrachini and Mirella Lapata",
title = "Models and Datasets for Cross-Lingual Summarisation",
booktitle = "Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing ",
year = "2021",
address = "Punta Cana, Dominican Republic",
}
Our XWikis corpus is now on HuggingFace datasets. Follow this link to find all language subsets available for download. Thank you to Ronald Cardenas for helping to upload to HF and Huajian Zhang and Guangyu Li for adding Chinese subsets.
The original XWikis corpus is available at XWikis Corpus.
Instructions to re-create our corpus and extract different languages are available here.
Our code is based on Fairseq and mBART/mBART50. You'll find our clone of Fairseq and the code extension to implement our models here and instructions to pre-process the data, and train and evaluate our models here.