Skip to content

Latest commit

 

History

History

preprocessing

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Preprocessing

Here are some instructions for preprocessing a new dataset in the same way as the datasets used by us. Tilse needs to be downloaded. It contains the HeidelTime tool in this folder.

To start, we assume you have a directory for the new dataset with the following file structure:

dataset/
└── topic
    ├── articles.jsonl
    ├── timelines.jsonl
    └── keywords.json

Each line in articles.jsonl is a document or article in JSON format, with the following fields: id, time, text and optionally title. The document IDs must be unique. timelines.jsonl is only needed if you actually have ground-truth timelines. time is the publication time of an article. It should be at the day level or more specific and can be written in any format as long as it is recognised by arrow.get in the arrow library.

You can download a "raw" mock dataset from here.

We need to define two paths:

DATASET=<your dataset folder>
HEIDELTIME=<heideltime folder from above>

We then run these preprocessing steps (following Tilse):

python preprocess_tokenize.py --dataset $DATASET
python preprocess_heideltime.py --dataset $DATASET --heideltime $HEIDELTIME
python preprocess_spacy.py --dataset $DATASET

Note that the second step (running HeidelTime) is the slowest and is also responsible some amount of articles being removed in the last step if they cannot be parsed from HeidelTime's output.

Loading the preprocessed dataset:

from news_tls.data import Dataset

dataset = Dataset('<path to dataset>')
for col in dataset.collections:
    print(col.name) # topic name
    print(col.keywords) # topic keywords
    for a in col.articles(): # articles collection of this topic
        pass # do something with the articles