Skip to content
/ UASSU Public

AACL'2022: Unsupervised Single Document Abstractive Summarization using Semantic Units

Notifications You must be signed in to change notification settings

IKMLab/UASSU

Repository files navigation

Unsupervised Single Document Abstractive Summarization using Semantic Units

This is the source code of our AACL'2022 paper Unsupervised Single Document Abstractive Summarization using Semantic Units (paper link).

Environment

Our code requires the settings below:

Operation system Ubuntu 18.04+
Python version 3.6.9+
CUDA version cuda11.2
Packages sum_dist/requirements.txt

Installation

  1. Download this repo
git clone [email protected]:IKMLab/UASSU.git
# or
git clone https://github.com/IKMLab/UASSU.git
  1. Install packages
cd UASSU
pip install -r requirements.txt
pip install git+https://github.com/huggingface/datasets
# If using CUDA11:
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

And we need to install pyrouge for evaluation.

# We packed the steps into a script.
bash pyrouge_setup.sh

Successful installation of pyrouge will display the output like:

------------------------------------------------
Ran 10 tests in 4.482s

OK

Reference of potential problems when installing pyrouge:

We also have to setup spaCy.

pip install -U pip setuptools wheel
pip install -U spacy

# Install models for corresponding languages
# en (for CNN/DM, XSum, Wiki_en, ArXiv)
python -m spacy download en_core_web_sm
# de (for MLSUM_de)
python -m spacy download de_core_news_sm
# es (for MLSUM_es)
python -m spacy download it_core_news_sm
# ru (for MLSUM_ru)
python -m spacy download ru_core_news_sm

Data pre-processing

Pre-processed data (.pkl) are available at this link, and place the downloaded .pkl file at sum_dist/data/preprocess/

Or you can process data with the following scripts:

CNN/DM

python -m sum_dist.preprocess.preprocess \
-dataset cnndm \
-read_config sum_dist/exp_configs/config_preliminary_cnndm.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

XSum

python -m sum_dist.preprocess.preprocess \
-dataset xsum \
-read_config sum_dist/exp_configs/config_preliminary_xsum.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

MLSUM_de

python -m sum_dist.preprocess.preprocess \
-dataset mlsum_de \
-read_config sum_dist/exp_configs/config_preliminary_mlsum_de.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

MLSUM_es

python -m sum_dist.preprocess.preprocess \
-dataset mlsum_es \
-read_config sum_dist/exp_configs/config_preliminary_mlsum_es.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

MLSUM_ru

python -m sum_dist.preprocess.preprocess \
-dataset mlsum_ru \
-read_config sum_dist/exp_configs/config_preliminary_mlsum_ru.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

Wiki_en

python -m sum_dist.preprocess.preprocess \
-dataset wiki_en \
-read_config sum_dist/exp_configs/config_preliminary_wiki_en.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 10

ArXiv

python -m sum_dist.preprocess.preprocess \
-dataset arxiv \
-read_config ./sum_dist/exp_configs/config_preliminary_arxiv.json \
-tokenizer bert \
-save_dir ./sum_dist/data/preprocess \
-num_worker 5

Model training

Download link for trained checkpoints

bash scripts/train_cnndm_w5.sh

Inference

bash scripts/infer_cnndm_w5.sh

Evaluation

bash scripts/evaluate_cnndm_w5.sh

Datasets & Required Summary Length

For setting truncate_len during evaluation.

Dataset Summary Length
CNN/DM 50
XSum 50
MLSUM_de 30
MLSUM_es 20
MLSUM_ru 15