This directory contains code necessary to replicate the training and evaluation for the EMNLP 2020 paper "Factual Error Correction for Abstractive Summarization Models" by Meng Cao, Yue Dong, Jiapeng Wu and Jackie Chi Kit Cheung.
Our code is organized into four subdirectories:
build_dataset
: code for building the aritificial trianing & test dataset.cnn-dailymail
: directory for the cnn-dailymail summarization dataset.K2019
: directory for the manually annotated dataset by Kryscinski et al. (2019).model
: wrapper for the fariseq BART model for training.
To build the training dataset, first download the processed cnn-dailymail dataset from this link. Unzip and save the downloaded files in cnn-dailymail
.
Then, run the data creation bash to build the training data:
cd build_dataset
sh create_data.sh
We use BART as our base model. To download and use BART model, follow the instructions here.
The annotated cnn-dailymail test set from Kryscinski et al. 2019 ACL paper.