Team Setup
proc_docset.sh input_xml_file1 input_xml_file2 input_xml_file3 training_files_path training_files_output_dir dev_files_path dev_files_output_dir eval_files_path eval_files_output_dir
Location of all source code
extract.py
- Extract topic ids and corresponding doc id from DocSetA (into format such as
topic.docSetA.id
) - Transform doc id into path
- Extract topic ids and corresponding doc id from DocSetA (into format such as
{training|dev|eval}_process.py
- Use
XML.etree.ElementTree
library to parse XML- If the file is NOT a standard XML (e.g.
/corpora/LDC/LDC02T31/apw/1999/19990914_APW_ENG
) -> modify the file by adding<DOCSTREAM>
tag
- If the file is NOT a standard XML (e.g.
- Find doc id by indexing DOCNO keyword
- If the file is standard XML, find the doc by looking for id keyword.
- Extract headline and dateline from HEADLINE and DATELINE keyword.
- Get individual sentence by
doc.TEXT.findall(“P”)
- Use
- Including the 3 guided summary xml files for
training
,dev
andeval
set - The data come from TAC 2010 Guided Summarization task
devtest
andevaltest
each has an accompanyingcategories.txt
file that captures the 5 types of topics occurred in the docset
- Including the results from each of the 3 sets respectively under
{training|devtest|evaltest}_output/
- Create sub directories named by
docsetA_id
(e.g., D0901A-A) by joiningoutput_directory
withdocID_docsetID_pairs[docID]
- Store the outputs in each sub file of the above by the
doc_id
(e.g., XIN_ENG_20041113.0001)
- Store the outputs in each sub file of the above by the
- In total, 71 docsetA sub directories under
training_output/
, 88 underdevtest_output
and 44 underevaltest_output
- Create sub directories named by
- XIN docs only exist after 2000. For these files, we use Fei's convention to find th path.
- XIN docs are actually XIE docs before 2000. XIN's between the period 1996-2000, we use Fei's convention to find th path.
- NYT docs before the year 2000 correspond to files without
_ENG
in thedoc_id
.
- Use
XML.etree.ElementTree library
to parse XML inextract.py
and{training|dev|eval}_process.py
- Use
nltk.word_tokenize
to tokenize the sentences in{training|dev|eval}_process.py
- Clone the pretrained model (with git-lfs so make sure you have it installed
https://git-lfs.com
)
git lfs install
git clone https://huggingface.co/google/bert_uncased_L-12_H-768_A-12
Final Summarization Systems
Final Report