Paper accepted at the NAACL-HLT 2021:
AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization, by Tiezheng Yu*, Zihan Liu*, Pascale Fung.
State-of-the-art abstractive summarization models generally rely on extensive labeled data, which lowers their generalization ability on domains where such data are not available. In this paper, we present a study of domain adaptation for the abstractive summarization task across six diverse target domains in a low-resource setting. Specifically, we investigate the second phase of pre-training on large-scale generative models under three different settings: 1) source domain pre-training; 2) domain-adaptive pre-training; and 3) task-adaptive pre-training. Experiments show that the effectiveness of pre-training is correlated with the similarity between the pre-training data and the target domain task. Moreover, we find that continuing pre-training could lead to the pre-trained model's catastrophic forgetting, and a learning method with less forgetting can alleviate this issue. Furthermore, results illustrate that a huge gap still exists between the low-resource and high-resource settings, which highlights the need for more advanced domain adaptation methods for the abstractive summarization task.
We release the AdaptSum dataset, which contains the summarization datasets across six target domains as well as the corpora for SDPT, DAPT and TAPT. You can download AdaptSum from Here.
- Create a new folder named
dataset
at the root of this project - Download the data from google drive and then put it in the
dataset
folder - Create the conda environment
conda create -n adaptsum python=3.6
- Activate the conda environment
conda activate adaptsum
- Install pytorch. Please check your CUDA version before the installation and modify it accordingly, or you can refer to pytorch website
conda install pytorch cudatoolkit=11.0 -c pytorch
- Install requirements
pip install -r requirements.txt
- Create a new folder named
logs
at the root of this project
- We take
cnn_dm
as an example
- Create a new folder named
SDPT_save
at the root of this project - Prepare dataloader:
python ./src/preprocessing.py -data_path=dataset/ \ -data_name=SDPT-cnn_dm \ -mode=train \ -batch_size=4
- Run
./scripts/sdpt_pretraining.sh
. You can add-recadam
and-logging_Euclid_dist
to use RecAdam.
- We take
debate domain
as an example
- Create a new folder named
DAPT_save
at the root of this project - Run
./scripts/dapt_pretraining.sh
. You can add-recadam
and-logging_Euclid_dist
to use RecAdam.
- We take
debate domain
as an example
- Create a new folder named
TAPT_save
at the root of this project - Run
./scripts/tapt_pretraining.sh
. You can add-recadam
and-logging_Euclid_dist
to use RecAdam.
- We take
debate domain
as an example
-
Create a new folder named
debate
atlogs
-
Prepare dataloader:
python ./src/preprocessing.py -data_path=dataset/ \ -data_name=debate \ -mode=train \ -batch_size=4 python ./src/preprocessing.py -data_path=dataset/ \ -data_name=debate \ -mode=valid \ -batch_size=4 python ./src/preprocessing.py -data_path=dataset/ \ -data_name=debate \ -mode=test \ -batch_size=4
-
Install
pyrouge
package (You can skip this if you have already installedpyrouge
)- Step 1 : Install Pyrouge from source (not from pip)
git clone https://github.com/bheinzerling/pyrouge cd pyrouge pip install -e .
- Step 2 : Install official ROUGE script
git clone https://github.com/andersjo/pyrouge.git rouge
- Step 3 : Point Pyrouge to official rouge script (The path given to pyrouge should be absolute path !)
pyrouge_set_rouge_path ~/pyrouge/rouge/tools/ROUGE-1.5.5/
- Step 4 : Install libxml parser As mentioned in this issue, you need to install libxml parser
sudo apt-get install libxml-parser-perl
- Step 5 : Regenerate the Exceptions DB As mentioned in this issue, you need to regenerate the Exceptions DB
cd rouge/tools/ROUGE-1.5.5/data rm WordNet-2.0.exc.db ./WordNet-2.0-Exceptions/buildExeptionDB.pl ./WordNet-2.0-Exceptions ./smart_common_words.txt ./WordNet-2.0.exc.db
- Step 6 : Run the tests
python -m pyrouge.test
-
Run Finetuning
- If you don't want to use any second phase of pre-training, run:
python ./src/run.py -visible_gpu=0 \ -data_name=debate \ -save_interval=100 \ -start_to_save_iter=3000
- If you want to use pretrained checkpoints from SDPT, run:
python ./src/run.py -visible_gpu=0 \ -data_name=debate \ -save_interval=100 \ -start_to_save_iter=3000 \ -pre_trained_src \ -train_from=YOUR_SAVED_CHECKPOINTS
- If you want to use pretrained checkpoints from DAPT or TAPT, run:
python ./src/run.py -visible_gpu=0 \ -data_name=debate \ -save_interval=100 \ -start_to_save_iter=3000 \ -pre_trained_lm=YOUR_SAVED_CHECKPOINTS
- If you don't want to use any second phase of pre-training, run:
-
Evaluate the performance
- Make a folder named
inference
atlogs
- You can do inference by
python ./src/inference.py -visible_gpu=0 -train_from=YOUR_SAVED_CHECKPOINT
- You can calculate rouge scores by
python ./src/cal_roug.py -c=CANDIDATE_FILE -r=REFERENCE_FILE -p=NUMBER_OF_PROCESS
- Make a folder named
If you use our benchmark or the code in this repo, please cite our paper.
@inproceedings{Yu2021AdaptSum, title={AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization}, author={Tiezheng Yu and Zihan Liu and Pascale Fung}, journal={arXiv preprint arXiv:2103.11332}, year={2021} }
Also, please consider citing all the individual datasets in your paper.
Dialog domain:
@inproceedings{gliwa2019samsum, title={SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization}, author={Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander}, booktitle={Proceedings of the 2nd Workshop on New Frontiers in Summarization}, pages={70--79}, year={2019} }
Email domain:
@inproceedings{zhang2019email, title={This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation}, author={Zhang, Rui and Tetreault, Joel}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, pages={446--456}, year={2019} }
Movie and debate domains:
@inproceedings{wang2016neural, title={Neural Network-Based Abstract Generation for Opinions and Arguments}, author={Wang, Lu and Ling, Wang}, booktitle={Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, pages={47--57}, year={2016} }
Social media domain:
@inproceedings{kim2019abstractive, title={Abstractive Summarization of Reddit Posts with Multi-level Memory Networks}, author={Kim, Byeongchang and Kim, Hyunwoo and Kim, Gunhee}, booktitle={Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}, pages={2519--2531}, year={2019} }
Science domain:
@inproceedings{yasunaga2019scisummnet, title={Scisummnet: A large annotated corpus and content-impact models for scientific paper summarization with citation networks}, author={Yasunaga, Michihiro and Kasai, Jungo and Zhang, Rui and Fabbri, Alexander R and Li, Irene and Friedman, Dan and Radev, Dragomir R}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={33}, pages={7386--7393}, year={2019} }