Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization
We provide the source code for the paper "Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization", accepted at ACL'19. If you find the code useful, please cite the following paper.
@inproceedings{cho-lebanoff-foroosh-liu:2019,
Author = {Sangwoo Cho and Logan Lebanoff and Hassan Foroosh and Fei Liu},
Title = {Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization},
Booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)},
Year = {2019}}
This repository contains the code for a similarity measure network using Capsule network.
This code is developed with the following environment:
- Python 2.7
- Keras 2.2.4
- Tensorflow 1.12.0 backend
pip install -r requirements.txt
$ git clone https://github.com/sangwoo3/summarization-dpp-capsnet.git & cd summarization-dpp-capsnet
$ mkdir data & cd data
- Download CNN/DM summary pair dataset from HERE and extract it under
/data
directory- This summary dataset is pre-processed with 50k prevailing vocabularies in CNN/DM summary pair dataset. The label is 1 for a positive pair sentence, and 0 for a negative pair. The positive pair is a pair of a summary sentence and its most similar sentence in the source document that leads to the largest Rouge scores. The negative pair is a pair of the same summary sentence and a random sentence in the same document.
- Download Glove word vectors of 50k vocabulary from HERE and place it under
/data
directory- 6B tokens, 300d Glove word vectors are used LINK
- If you want raw CNN/DM summary dataset, download from HERE.
- This data contains candiate summary sentences for each document. The data is pre-processed with the
preprocess.py
file to generate the above CNN/DM summary pair dataset.)
- This data contains candiate summary sentences for each document. The data is pre-processed with the
$ python main_Capsnet.py
$ python main_Capsnet.py --testing
$ python main_Capsnet.py --testing --test_mode STS
- Download the pre-trained model from HERE and place it under
/result/capnet_sim
directory/result/capnet_sim
is a default directory for training results
- Download the model fine-tuned on STS dataset from HERE
- This model is trained on CNN/DM summary pair dataset and then fine-tuned on STS.
- It can be used to evaluate STS prediction accuracy.
We provide our best system summaries of DUC04 and TAC11. They are generated with DPP and in the system_summary
directory.
For DPP and multi-document dataset, we do not provide the code and dataset due to license. Please refer to DPP code and download DUC 03/04 and TAC 08/09/10/11 dataset with your request and approval.
This project is licensed under the BSD License - see the LICENSE.md file for details.