Skip to content

Latest commit

 

History

History
124 lines (94 loc) · 5.55 KB

README.md

File metadata and controls

124 lines (94 loc) · 5.55 KB

Resources for the paper 'Abstractive Document Summarization Without Parallel Data'.

Datasets

We provide the following datasets:

  • The CNN/DailyMail parallel summarization dataset we use in the paper. Our processed version is available here.
  • Our dataset of the plain text of 350k press releases, scraped from EurekAlert, can be downloaded from here. Thanks to EurekAlert for allowing us to share it.
  • Our testing dataset for the scientific summarization task ...(to be added, waiting for permission)...

Dependencies

This code base comes with a modified old version of Fairseq, as well as with snapshots of the subword-nmt, METEOR and ROUGE repositories.

To install all project dependencies, you can run pip install -r requirements.txt. You'll also need to install our Fairseq version, which might also require downgrading PyTorch, depending future compatibility.

Running the system

The pipeline for training the sentence paraphrasing model consists of two components: a sentence extractor, which selects salient sentences from the article, and a sentence abstractor, which paraphrases each of the extracted sentences:

We have implemented two sentence extractors: Lead, which picks the first sentences from the article, and LexRank.

Extracting pseudo-parallel data

Follow the instructions from this repository to extract pseudo-parallel data from your raw datasets.

Training a backtranslation model on the sentence level

Given a pseudo-parallel dataset which contains files that follow the naming convention: train.article.clean, train.summary.clean, valid.article.clean, valid.summary.clean, test.article.clean, test.summary.clean (where each file contains one sentence per line)

You can train a backtranslation model on your pseudo-parallel sentences using the following commands:

# Define the path to the BPE and Fairseq dictionaries you will use. 
# If you don't provide them, they will be learned from the data automatically. 
export BPE_DICT=~/nikola/joint/joint50k.bpe
export FAIRSEQ_DICT=~/nikola/joint/data/dict.clean.txt
# Start the pipeline
bash train_pipeline/base_pipeline.sh 1 summary article true true lstm_tiny $BPE_DICT $BPE_DICT $FAIRSEQ_DICT $FAIRSEQ_DICT

where the model will use all train/validation/testing files that contain summary as the source datasets, and all files that contain article as the target datasets here. The command will take care of dataset preparation, conversion to BPE and training. Read the scripts for more info on the specific commands. The script is using the fairseq library for training.

Generating synthetic sentences using the backtranslation model

Once the backtranslation model is trained, you can use it to synthesize additional source sentences. This can be done using the following script:

bash main/backtranslate_summary_sentences.sh target_sentences 2

The script expects the input file target_sentences to contain one sentence per line. By default, one sentence will be generated for each of the sentences.

Training the final sentence paraphrasing model

Once you have all your data prepared and combined into a single dataset (also using the naming convention train.article.clean, train.summary.clean, valid.article.clean, valid.summary.clean, test.article.clean, test.summary.clean), you can train your final sentence paraphrasing model using the same training script, but by setting the source and target datasets to be your datasets of article and summary sentences:

bash train_pipeline/base_pipeline.sh 1 article summary true true lstm_tiny $BPE_DICT $BPE_DICT $FAIRSEQ_DICT $FAIRSEQ_DICT

Running the whole system

Once the final paraphrasing model is trained, you can run the whole extractive-abstractive pipeline using the following command:

bash main/inference_pipeline.sh valid.paper.clean 2 10 lead

The above will first apply the Lead extractor to your article, and will then paraphrase each of the extracted sentences using your previously trained paraphrasing model. You'll need to modify some of the variables in the inference_pipeline.sh script to make it point to the correct model/vocabulary files of your paraphrasing model.

Citation

@InProceedings{nikolov2020abstractive,
  author = {Nikola I. Nikolov and Richard Hahnloser},
  title = {Abstractive Document Summarization without Parallel Data},
  booktitle = {Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020)},
  year = {2020},
  month = {may},
  date = {11-16},
  location = {Marseille, France},
  editor = {},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  language = {english}
  }