-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
25 lines (19 loc) · 1.53 KB
/
README.txt
1
SummarizerScratchhttps://github.com/silverstar194/SummerizerScratchOverview: This summarization tool will take a PDF, extract the text and provide a list of the most important sentences. It bases importance on frequency of words and position of sentence within the PDF. Loosely based on Summarization of Document using Java by Priyanka Sarraf (Banasthali University) and Yogesh Kumar Meena (Assistant Professor, MNIT). Also aimed to complete ÒHow to Build an Automated Text Summarizer: An extraction-based summarizerÓ CS784 Spring 2013 David R. Cheriton School of Computer Science (although I am not enrolled in the course). See documents directory for more information on the research paper and course. Program Flow:1. PDF filepath and compression ratio provided by user2. Using PDFBox text is extracted and broken into sentences3. Each word is stemmed using Porter stemmer algorithm4. Stemmed words occurrences are counted5. Sentence keyword score is calculated based on number of keywords and frequency weight of the keyword6. Sentence position score is calculated Ð with middle having highest weight and beginning and end lower weight7. Position score and keyword score are each weighted with constants from Config.java8. Highest Rated sentences (up to ratio limit) are selected and ordered based on position9. Sentences are written to summary PDF using PDFBoxFuture Improvements:-Incorporate cue words-Move away from linear weight of keywords-Highlight original text within PDF-Allow scores from adjacent sentences to compound