pdfsummary

Python script that creates a text summary of a PDF file.

Created for HackPrinceton2019.

Installation

To install pdfsummary from GitHub:

git clone https://github.com/archen2019/pdfsummary

pdfsummary also needs the dependencies listed in requirements.txt. To install, run the following:

pip install -r requirements.txt

tesseract is also a required dependency. To install tesseract, follow the instructions at https://github.com/tesseract-ocr/tesseract/wiki.

Usage

To run pdfsummary, simply enter the cloned folder and run the file run.py. Then, enter the file name of the PDF, the number of sentences in the summary, and the number of keywords.

$ cd pdfsummary
$ python run.py
File Name: [FILE-NAME]
Number of sentences in summary: [NUM-SENTENCES]
Number of key phrases: [NUM-PHRASES]

This will create 4 files in the directory of the original PDF:

keyphrases.txt A text file containing the key phrases.
summary.txt A text file containing the summary.
Summary.pdf A PDF file containing the key phrases and the summary.
highlighted.pdf A PDF file containing the original pdf, with key phrases highlighted.

Methodology

Use pdf2image to convert the PDF into PNG images.
Use tesseract to extract text from images and create a text-searchable copy of the original PDF.
Process text to remove extra newlines and reconnect hyphenated words.
Use sumy to generate a summary of the processed text.
Use pke to generate key phrases from the processed text.
Create pdf containing key phrases and summary.
Highlight key phrases in text-searchable PDF.

Citations

Boudin, Florian. “Pke: An Open Source Python-Based Keyphrase Extraction Toolkit.” Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, The COLING 2016 Organizing Committee, 2016, pp. 69–73. ACLWeb, https://www.aclweb.org/anthology/C16-2015.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

pdfsummary

Table of Contents

Installation

Usage

Methodology

Citations

Files

README.md

Latest commit

History

README.md

File metadata and controls

pdfsummary

Table of Contents

Installation

Usage

Methodology

Citations