-
Notifications
You must be signed in to change notification settings - Fork 20
OpenNLP
- Download OpenNLP here.
- Run command to train model (script):
opennlp TokenNameFinderTrainer -model <model.bin> -lang <pt> -data <training_data.txt> -encoding <UTF-8>
- model.bin - Output model name
- pt - Language of the model
- training_data.txt - Input dataset, in the right format, for training the NER model
- UTF-8 - Encoding of the model
- Run command to perform NER (script):
opennlp TokenNameFinder <model.bin> < <corpus_test.txt> > <output file>
- model.bin - Input model name
- corpus_test.txt - Input dataset, in the right format, for evaluating the NER model (Note: it has to be in UTF-8)
- output file - Output file for the tagged text
- Run command to evaluate NER (script):
opennlp TokenNameFinderEvaluator -encoding <UTF-8> -model <model.bin> -data <corpus_test.txt>
- model.bin - Input model name
- corpus_test.txt - Input dataset, in the right format, for evaluating the NER model
- UTF-8 - Encoding of the model
Check folder for more information.
File with sentences separated with a new line character. Entities separated with a <Start:tag-name>
and <END>
tags. Example:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .
Using re
python library, replace <EM>
tags with appropriate <START>
tags, and </EM>
tags with <END>
tags:
newdata = re.sub(r"<EM CATEG=\"(\w+)\">", r"<START:\1>", newdata)
newdata = newdata.replace("</EM>","<END>")
Check folder for this process.
After replacing the tags, sentence segmentation has to be performed. NLTK was used for this effect, in this script.
Since the segmentation was not perfect, I had to join faulty segmentations using this script.
Then, since the start and end tags must have a space before and after the tag (or else it won't work), I did that using this script.
Finally, in order to provide the input for the OpenNLP's TokenNameFinder, I removed the tags from the test set, check the script here.
In order to evaluate the results using the conlleval script, I had to convert the OpenNLP's output to Conll. Check here for the script.
Check this folder for all results.
Results after 4 repeats:
Level | Precision | Recall | F-measure |
---|---|---|---|
Categories | 55.43% | 51.94% | 53.63% |
Types | 52.13% | 45.40% | 48.53% |
Subtypes | 72.60% | 39.00% | 50.74% |
Filtered | 69.55% | 48.93% | 57.44% |
Note: to ensure correct results in evaluation, I used a script to show if there are any differences in the output and golden data, in terms of tokenization. In case of difference, I manually changed the files.
For this tool, I decided to check the influence of 2 hyperparameters: cutoff and iterations. The results are the following:
Cutoff (default: 5)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
0 | 49.93% | 46.73% | 47.00% | 49.27% |
3 | 52.05% | 48.90% | 52.52% | 56.35% |
4 | 52.38% | 48.12% | 52.35% | 56.72% |
5 | 50.90% | 47.59% | 50.76% | 56.87% |
6 | 50.85% | 46.41% | 50.64% | 55.91% |
7 | 50.21% | 46.34% | 50.73% | 55.25% |
10 | 49.09% | 44.78% | 50.65% | 54.56% |
Iterations (default: 100)
Value | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
70 | 50.75% | 47.39% | 50.04% | 55.83% |
80 | 50.85% | 47.51% | 50.52% | 56.27% |
90 | 50.91% | 47.54% | 50.75% | 56.52% |
100 | 50.90% | 47.59% | 50.76% | 56.87% |
110 | 50.94% | 47.67% | 51.22% | 57.16% |
120 | 51.19% | 47.81% | 51.81% | 57.27% |
125 | 51.31% | 47.77% | 51.81% | 57.38% |
130 | 51.33% | 47.68% | 51.91% | 57.30% |
135 | 51.22% | 47.68% | 51.94% | 57.26% |
150 | 51.31% | 47.56% | 52.02% | 57.47% |
160 | 51.46% | 47.46% | 52.20% | 57.50% |
170 | 51.52% | 47.43% | 52.61% | 57.49% |
180 | 51.45% | 47.50% | 52.58% | 57.44% |
200 | 51.38% | 47.58% | 52.59% | 57.73% |
Repeated holdout
Iterations | Cutoff | Precision | Recall | F-measure |
---|---|---|---|---|
100 | 5 | 87.99% | 77.96% | 82.67% |
170 | 4 | 87.87% | 78.98% | 83.19% |
Repeated 10-fold cross validation
Iterations | Cutoff | Precision | Recall | F-measure |
---|---|---|---|---|
100 | 5 | 88.43% | 79.12% | 83.52% |
170 | 4 | 88.03% | 79.85% | 83.74% |
Get the generated models in the Resources page.