This repo contains the code for the paper: How multi is Multi-Document Summarization (EMNLP 2022).
conda create --name multi_mds python=3.8
conda activate multi_mds
pip install -r requirements.txt
If the setup fails on jsonnet, see this issue.
You should pre-preprocess your dataset into jsonl
format where each lines includes the following fields:
documents
: a List of source documentssummary
: a reference (or system) summarytopic_id
: instance id
You can find an example input file in the repo: wcep10_test.jsonl
.
There are several steps for computing the AAC score:
- extract the openIE from all source documents and the summary
- prepare pairs of OpenIE
- compute alignment scores between source and summary propositions for each topic
- build greedily the maximally covering subsets of source documents
- compute the Area Above the Curve and save the coverage plot.
You can run a single command that will compute all steps together, while skipping accomplished steps (edit the path of raw_data_dir
and process_dir
):
bash run.sh [preprocessed_data] [dir_path]
Alternatively, you can run each step separately, as follows:
- Extract all Open IE tuples from the summary and the source documents.
export raw_data= # path to jsonl file
export data_dir= # output dir
python extract_open_ie.py --raw_data $raw_data \
--data_dir $data_dir \
--gpu 0
This script will create a directory $data_dir/oie
with the propositions from the summary and the documents.
- Prepare pairs:
python prepare_oie_pairs.py --data_dir $data_dir
This script will create a file $data_dir/pairs.pickle
with all possible pairs of open IE.
- Compute alignment scores between source and summary propositions for each topic:
python get_superpal_scores.py --data_dir $data_dir \
--model biu-nlp/superpal \
--device_ids 0,1,2,3 \
--batch_size 64
This script will run the alignment model on the $data_dir/pairs.pickle
and save the results in the directory $data_dir/result_npy
.
- Build greedy subsets of documents that maximize coverage
python build_greedy_subsets.py --data_dir $data_dir
- Compute AAC score and save plot in
$data_dir/plot.png
.
python get_aac_scores.py --data_dir $data_dir
@inproceedings{Wolhandler2022HowI,
title={How "Multi" is Multi-Document Summarization?},
author={Ruben Wolhandler and Arie Cattan and Ori Ernst and Ido Dagan},
booktitle={EMNLP},
year={2022}
}