University of Washington LING575 Multi-Document Summarization Project

D1

Team Setup

D2

Project 1: Process a docSet

proc_docset.sh input_xml_file1 input_xml_file2 input_xml_file3 training_files_path training_files_output_dir dev_files_path dev_files_output_dir eval_files_path eval_files_output_dir

Structure Overview

src/

Location of all source code

extract.py
- Extract topic ids and corresponding doc id from DocSetA (into format such as topic.docSetA.id)
- Transform doc id into path
{training|dev|eval}_process.py
- Use XML.etree.ElementTree library to parse XML
  - If the file is NOT a standard XML (e.g. /corpora/LDC/LDC02T31/apw/1999/19990914_APW_ENG) -> modify the file by adding <DOCSTREAM> tag
- Find doc id by indexing DOCNO keyword
  - If the file is standard XML, find the doc by looking for id keyword.
- Extract headline and dateline from HEADLINE and DATELINE keyword.
- Get individual sentence by doc.TEXT.findall(“P”)

data/

2009_TAC

Including the 3 guided summary xml files for training, dev and eval set
The data come from TAC 2010 Guided Summarization task
devtest and evaltest each has an accompanying categories.txt file that captures the 5 types of topics occurred in the docset

selected_files

outputs/

Including the results from each of the 3 sets respectively under {training|devtest|evaltest}_output/
- Create sub directories named by docsetA_id (e.g., D0901A-A) by joining output_directory with docID_docsetID_pairs[docID]
  - Store the outputs in each sub file of the above by the doc_id (e.g., XIN_ENG_20041113.0001)
- In total, 71 docsetA sub directories under training_output/, 88 under devtest_output and 44 under evaltest_output

results/

Anomalies & Missing Files Handling

XIN docs only exist after 2000. For these files, we use Fei's convention to find th path.
XIN docs are actually XIE docs before 2000. XIN's between the period 1996-2000, we use Fei's convention to find th path.
NYT docs before the year 2000 correspond to files without _ENG in the doc_id.

Other Materials for D2

Misc

Use XML.etree.ElementTree library to parse XML in extract.py and {training|dev|eval}_process.py
Use nltk.word_tokenize to tokenize the sentences in {training|dev|eval}_process.py

D3

Initial model

D4

Improved Model

Clone the pretrained model (with git-lfs so make sure you have it installed https://git-lfs.com)

git lfs install
git clone https://huggingface.co/google/bert_uncased_L-12_H-768_A-12

D5

Final Summarization Systems

D6

Final Report

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.idea		.idea
D1		D1
D2		D2
D3		D3
D4		D4
D5		D5
D6/doc		D6/doc
.gitignore		.gitignore
InfoLM-Summarization-Evaluation.pdf		InfoLM-Summarization-Evaluation.pdf
Pyramid-Evaluation.pdf		Pyramid-Evaluation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

University of Washington LING575 Multi-Document Summarization Project

D1

D2

Project 1: Process a docSet

Structure Overview

src/

data/

2009_TAC

selected_files

outputs/

results/

Anomalies & Missing Files Handling

Other Materials for D2

Misc

D3

Initial model

D4

Improved Model

D5

D6

About

Releases

Packages

Contributors 4

Languages

cl-victor1/Multi-Document-Summarization

Folders and files

Latest commit

History

Repository files navigation

University of Washington LING575 Multi-Document Summarization Project

D1

D2

Project 1: Process a docSet

Structure Overview

src/

data/

2009_TAC

selected_files

outputs/

results/

Anomalies & Missing Files Handling

Other Materials for D2

Misc

D3

Initial model

D4

Improved Model

D5

D6

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages