Skip to content

OpenNLP

André Pires edited this page Jun 16, 2017 · 48 revisions

Steps to run OpenNLP

  1. Download OpenNLP here.
  2. Run command to train model (script): opennlp TokenNameFinderTrainer -model <model.bin> -lang <pt> -data <training_data.txt> -encoding <UTF-8>
    1. model.bin - Output model name
    2. pt - Language of the model
    3. training_data.txt - Input dataset, in the right format, for training the NER model
    4. UTF-8 - Encoding of the model
  3. Run command to perform NER (script): opennlp TokenNameFinder <model.bin> < <corpus_test.txt> > <output file>
    1. model.bin - Input model name
    2. corpus_test.txt - Input dataset, in the right format, for evaluating the NER model (Note: it has to be in UTF-8)
    3. output file - Output file for the tagged text
  4. Run command to evaluate NER (script): opennlp TokenNameFinderEvaluator -encoding <UTF-8> -model <model.bin> -data <corpus_test.txt>
    1. model.bin - Input model name
    2. corpus_test.txt - Input dataset, in the right format, for evaluating the NER model
    3. UTF-8 - Encoding of the model

Check folder for more information.

OpenNLP dataset format

File with sentences separated with a new line character. Entities separated with a <Start:tag-name> and <END> tags. Example:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

Convert HAREM dataset to OpenNLP format

Step 1 - Change tags

Using re python library, replace <EM> tags with appropriate <START> tags, and </EM> tags with <END> tags:

newdata = re.sub(r"<EM CATEG=\"(\w+)\">", r"<START:\1>", newdata)

newdata = newdata.replace("</EM>","<END>")

Check folder for this process.

Step 2 - Sentence segmentation

After replacing the tags, sentence segmentation has to be performed. NLTK was used for this effect, in this script.

Since the segmentation was not perfect, I had to join faulty segmentations using this script.

Then, since the start and end tags must have a space before and after the tag (or else it won't work), I did that using this script.

Finally, in order to provide the input for the OpenNLP's TokenNameFinder, I removed the tags from the test set, check the script here.

OpenNLP to Conll

In order to evaluate the results using the conlleval script, I had to convert the OpenNLP's output to Conll. Check here for the script.

Average results

Check this folder for all results.

Results after 4 repeats:

Level Precision Recall F-measure
Categories 55.43% 51.94% 53.63%
Types 52.13% 45.40% 48.53%
Subtypes 72.60% 39.00% 50.74%
Filtered 69.55% 48.93% 57.44%

Note: to ensure correct results in evaluation, I used a script to show if there are any differences in the output and golden data, in terms of tokenization. In case of difference, I manually changed the files.

Hyperparameter study

For this tool, I decided to check the influence of 2 hyperparameters: cutoff and iterations. The results are the following:

Cutoff (default: 5)

Value Categories Types Subtypes Filtered
0 49.93% 46.73% 47.00% 49.27%
3 52.05% 48.90% 52.52% 56.35%
4 52.38% 48.12% 52.35% 56.72%
5 50.90% 47.59% 50.76% 56.87%
6 50.85% 46.41% 50.64% 55.91%
7 50.21% 46.34% 50.73% 55.25%
10 49.09% 44.78% 50.65% 54.56%

opennlp cutoff graph

Iterations (default: 100)

Value Categories Types Subtypes Filtered
70 50.75% 47.39% 50.04% 55.83%
80 50.85% 47.51% 50.52% 56.27%
90 50.91% 47.54% 50.75% 56.52%
100 50.90% 47.59% 50.76% 56.87%
110 50.94% 47.67% 51.22% 57.16%
120 51.19% 47.81% 51.81% 57.27%
125 51.31% 47.77% 51.81% 57.38%
130 51.33% 47.68% 51.91% 57.30%
135 51.22% 47.68% 51.94% 57.26%
150 51.31% 47.56% 52.02% 57.47%
160 51.46% 47.46% 52.20% 57.50%
170 51.52% 47.43% 52.61% 57.49%
180 51.45% 47.50% 52.58% 57.44%
200 51.38% 47.58% 52.59% 57.73%

OpenNLP iterations graph

Results for SIGARRA News Corpus

Repeated holdout

Iterations Cutoff Precision Recall F-measure
100 5 87.99% 77.96% 82.67%
170 4 87.87% 78.98% 83.19%

Repeated 10-fold cross validation

Iterations Cutoff Precision Recall F-measure
100 5 88.43% 79.12% 83.52%
170 4 88.03% 79.85% 83.74%

Resources

Get the generated models in the Resources page.