OpenNLP

Steps to run OpenNLP

Download OpenNLP here.
Run command to train model (script): opennlp TokenNameFinderTrainer -model <model.bin> -lang <pt> -data <training_data.txt> -encoding <UTF-8>
1. model.bin - Output model name
2. pt - Language of the model
3. training_data.txt - Input dataset, in the right format, for training the NER model
4. UTF-8 - Encoding of the model
Run command to perform NER (script): opennlp TokenNameFinder <model.bin> < <corpus_test.txt> > <output file>
1. model.bin - Input model name
2. corpus_test.txt - Input dataset, in the right format, for evaluating the NER model (Note: it has to be in UTF-8)
3. output file - Output file for the tagged text
Run command to evaluate NER (script): opennlp TokenNameFinderEvaluator -encoding <UTF-8> -model <model.bin> -data <corpus_test.txt>
1. model.bin - Input model name
2. corpus_test.txt - Input dataset, in the right format, for evaluating the NER model
3. UTF-8 - Encoding of the model

Check folder for more information.

OpenNLP dataset format

File with sentences separated with a new line character. Entities separated with a <Start:tag-name> and <END> tags. Example:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

Convert HAREM dataset to OpenNLP format

Step 1 - Change tags

Using re python library, replace <EM> tags with appropriate <START> tags, and </EM> tags with <END> tags:

newdata = re.sub(r"<EM CATEG=\"(\w+)\">", r"<START:\1>", newdata)

newdata = newdata.replace("</EM>","<END>")

Check folder for this process.

Step 2 - Sentence segmentation

After replacing the tags, sentence segmentation has to be performed. NLTK was used for this effect, in this script.

Since the segmentation was not perfect, I had to join faulty segmentations using this script.

Then, since the start and end tags must have a space before and after the tag (or else it won't work), I did that using this script.

Finally, in order to provide the input for the OpenNLP's TokenNameFinder, I removed the tags from the test set, check the script here.

OpenNLP to Conll

In order to evaluate the results using the conlleval script, I had to convert the OpenNLP's output to Conll. Check here for the script.

Average results

Check this folder for all results.

Results after 4 repeats:

Level	Precision	Recall	F-measure
Categories	55.43%	51.94%	53.63%
Types	52.13%	45.40%	48.53%
Subtypes	72.60%	39.00%	50.74%
Filtered	69.55%	48.93%	57.44%

Note: to ensure correct results in evaluation, I used a script to show if there are any differences in the output and golden data, in terms of tokenization. In case of difference, I manually changed the files.

Hyperparameter study

For this tool, I decided to check the influence of 2 hyperparameters: cutoff and iterations. The results are the following:

Cutoff (default: 5)

Value	Categories	Types	Subtypes	Filtered
0	49.93%	46.73%	47.00%	49.27%
3	52.05%	48.90%	52.52%	56.35%
4	52.38%	48.12%	52.35%	56.72%
5	50.90%	47.59%	50.76%	56.87%
6	50.85%	46.41%	50.64%	55.91%
7	50.21%	46.34%	50.73%	55.25%
10	49.09%	44.78%	50.65%	54.56%

opennlp cutoff graph

Iterations (default: 100)

Value	Categories	Types	Subtypes	Filtered
70	50.75%	47.39%	50.04%	55.83%
80	50.85%	47.51%	50.52%	56.27%
90	50.91%	47.54%	50.75%	56.52%
100	50.90%	47.59%	50.76%	56.87%
110	50.94%	47.67%	51.22%	57.16%
120	51.19%	47.81%	51.81%	57.27%
125	51.31%	47.77%	51.81%	57.38%
130	51.33%	47.68%	51.91%	57.30%
135	51.22%	47.68%	51.94%	57.26%
150	51.31%	47.56%	52.02%	57.47%
160	51.46%	47.46%	52.20%	57.50%
170	51.52%	47.43%	52.61%	57.49%
180	51.45%	47.50%	52.58%	57.44%
200	51.38%	47.58%	52.59%	57.73%