This is our github project repository for UCLA CS 269 - Natural Language Generation course taught by Prof. Nanyun Peng.
Authors & Repository Contributors: Aakash Srinivasan, Arvind Krishna Sridhar, Anubhav Mittal
Abstractive text summarization is a well studied task in NLG. Recently, transformer based pretrained models have been quite successful in generating quality summaries with state-of-the-art performances both in terms of human evaluation and ROUGE scores. However, little work has been done in designing fine-tuning approaches for these models to allow generation of tailor-made summaries based on user preferences such as length, style, the entities of interest etc. In this work, we propose TailorBART - a BART based model capable of incorporating user preferences to generate tailored summaries. Through a variety of control-specific experiments and standard automated scores, we show empirically that the summaries generated by our model successfully capture user preferences while at the same time improves the ROUGE scores significantly.
We developed and release four different TailorBART model checkpoints for different user controls (please use @ucla.edu account to access):
- Length Control: TailorBART (length) checkpoint
- Entity Control: TailorBART (entity) checkpoint
- Length + Entity Control: TailorBART (length+entity) checkpoint
- Style Control: TailorBART (style) checkpoint
We primarily developed our proposed work TailorBART using the hugging face transformer library.
Following are the enviroment requirements for running our training pipeline code:
python == 3.7.6
numpy >= 1.19.4
pytorch >= 1.4.0
datasets >= 1.1.3
transformers >= 3.4.0
rouge-score == 0.0.4
pytorch_lightning >= 1.0.5
matplotlib >= 3.1.1
spacy >= 2.3.4
nltk >= 3.4.5
In order to run our demo, you will be required to install Streamlit Library additionally.
We use the CNN-DailyMail dataset for our experiments. We have extracted the dataset and saved in a convenient line separated .txt
files for articles and highlights. We share our preprocessed datasets below:
We used google colab to preprocess and find out the length entity bins. We chose the length bins such that each of the bins contain roughly same number of training data - Preprocessing Colab Notebook
For entity control, we found the entities belonging to important classes such as PERSON, ORG and GEO using Spacy's named entity recognition and filtered it to the top ten most frequent entities for each article. We anonymized the entities in both articles and their corresponding groundtruth summaries with special tokens using regex. Check here for implementation details.
For source-style control, we needed to find out the ground truth source of the article - whether the ariticle and its corresponding highlight belong to CNN or DailyMail source. To find this, we perform a string search for (CNN)
token in the input article. Articles containing this token are classified belonging to CNN source and rest of them belong to DailyMail. We observed that the dataset contained 90k CNN source articles and remaining 180k belong to DailyMail. Check here for implementation details.
We implemented a unified script for training TailorBART for different control. The training script is implemented in finetune_bart_model.py
here. We primarily use pytorch-lightning as a high-level training wrapper for our implementation.
Our implementation allows different control methods and the following are the values in the code that can be modified for different training and finetuning strategies:
FINETUNE_MODE = 'length' # Refers to the what control to be performed - can be one in ['length', 'entity', 'length_entity', 'style']
DATA_PATH = '/path/to/data' # Refers the path the the preprocessed data. For entity control, we use the anonymized data that can be downloaded from above.
MODEL_NAME = 'facebook/bart-base' # Denotes which BART model to consider - We can use several pretrained BART model types from hugging face
BATCH_SIZE = 8 # Batch Size for training the model
MAX_EPOCHS = 10 # Max number of epochs for training
Note: We perform gradient accumulation during training with accumulate_grad_batches=4
and set limit_train_batches=0.1
. We save all the checkpoints at the end of each epoch and choose the best model using the validation split. Emperically, we observed that fixing learning_rate=1e-4
,eval_beams=4
, no_repeat_ngram_size=3
, and length_penalty=2.0
gave the best performance. We also trained on a single gpu due to the GCP resource constraints but the script is scalable to multi-gpu training.
We trained our models using Google Cloud Platform (GCP) Deep Learning VM which comes in with several predefined libraries and CUDA support. We used GCP instances with 30GB RAM, 8 CPU machines and 1 Tesla K80 GPU for training TailorBART models.
We primarily used Google Colab for evaluations. We developed the following Colab Notebook for multi-purpose automatic evaluation of our model: Colab Notebook Link. Similar to the training code, the evaluation script is generic and can be used for all control types. Set the runtime type in Colab to GPU for faster execution. Also, you are required to upload the datasets to google drive for access. compute_rouge.py
is a helper util for accurate evaluation of ROUGE scores.
We used Streamlit in Python for creating an interative web interface with our trained models at the backend to generate output summary based on the user's preferences. We presented the navigation to different summary controls using a single choice radio group. For selecting the entities, we displayed a dropdown menu with top ten most frequent entities in the article. The output summary section highlights the entities requested and also mentions the summary length information. To try our demo, simply run streamlit run web-interface/web_interface.py
and with appropriate model checkpoints stored in configurable checkpoint folder. See this for more details how to configure checkpoint path. Once we run this command, we get both local ip address and external ip address to try out the demo.