Skip to content

Latest commit

 

History

History
72 lines (53 loc) · 2.33 KB

01_data_format.md

File metadata and controls

72 lines (53 loc) · 2.33 KB

Corpus Format

This guide explains how you can convert your own annotated data to the corpus format used in this project. You can then use that corpus to train your own sequence labeling models.

Data Format

We use the brat standoff annotation format. You will need two files for each document in your dataset: document_[id].txt and document_[id].ann.

Example document_1.txt:

Dit is stukje tekst met daarin de naam Ilker Koopal. De patient I. Koopal (e: [email protected], t: 06-16769063) is 64 jaar oud
en woonachtig in Orvelte. Hij werd op 22 november door arts Omid Esajas ontslagen van de kliniek van het UMCU.

Example document_1.ann:

T1 Name 39 51 Ilker Koopal
T2 Name 64 73 I. Koopal
T3 Email 78 95 [email protected]
T4 Phone_fax 100 111 06-16769063
T5 Age 116 123 64 jaar
T6 Address 145 152 Orvelte
T7 Date 166 177 22 november
T8 Name 188 199 Omid Esajas
T9 Hospital 233 237 UMCU

Corpus Location

After you converted your documents to the standoff format, copy them to the data/corpus/<corpus_name>/ directory. Here is an example for the dummy corpus. In all experiment code, we follow the convention that the name of corpus directory identifies the dataset.

data/
└── corpus
    └── dummy
        ├── dev
        │   ├── example-3.ann
        │   └── example-3.txt
        ├── test
        │   ├── example-2.ann
        │   └── example-2.txt
        └── train
            ├── example-1.ann
            └── example-1.txt

Create a train/dev/test Split

If you don't have a predefined train/dev/test split, you can also use the following utility to create one:

python deidentify/dataset/brat2corpus.py <corpus_name> <data_path>

This script will take all *.ann/*.txt in data_path and create a new corpus at data/corpus/<corpus_name> with a 60/20/20 train/dev/test set ratio.

Load Your Corpus

After you created your corpus at data/corpus/<corpus_name> you should be able to load it by executing:

from deidentify.dataset.corpus_loader import CorpusLoader, CORPUS_PATH

# Pick the name of your corpus here:
corpus = CorpusLoader().load_corpus(path=CORPUS_PATH['dummy'])
print(corpus)

# This should print:
# Corpus(name=dummy). Number of Documents (train/dev/test): 1/1/1