Home

This wiki documents the development process for my master's thesis, named Entity and relation extraction from web content.

First, the HAREM dataset was used to perform NER using available tools, namely Stanford NER, NLTK, OpenNLP and spaCy.

Main repository folders

brat: annotation tool and annotated SIGARRA's news
datasets: Keeps the datasets used
scripts:
- extra: scripts not yet used
- evaluation: scripts to compute the evaluation of all tools, using the conlleval script
- filter-harem: scripts to manipulate HAREM dataset
  - harem-to-opennlp: transform HAREM in opennlp input format
  - harem-to-standoff: transform HAREM in standoff format, used in spacy
  - harem-to-stanford: transform HAREM in conll format, used in stanfordNER
  - src: source files for scripts
  - run-scripts: commands to run scripts
tools:
- nltk: folder to keep nltk related data/scripts
- open-nlp: folder to keep open-nlp related data/scripts
- spacy: folder to keep spacy related data/scripts
- stanford-ner: folder to keep stanford NER related data/scripts

All programs were intended to be ran across HAREM with four different categories:

Categories: use only categories
Types: use only types
Subtypes: use only subtypes
Filtered: use filtered categories

Results

Taking into account only the categories, the results, ordered by F-measure, were:

Stanford CoreNLP: 56.10%
OpenNLP: 53.63%
SpaCy: 46.81%
NLTK: 30.97%

Results for categories:

Tool	Precision	Recall	F-measure
Stanford CoreNLP	58.84%	53.60%	56.10%
OpenNLP	55.43%	51.94%	53.63%
SpaCy	51.21%	43.10%	46.81%
NLTK	30.58%	31.38%	30.97%

F-measure for all levels:

Tool	Categories	Types	Subtypes	Filtered
Stanford CoreNLP	56.10%	-	-	61.10%
OpenNLP	53.63%	48.53%	50.74%	57.44%
SpaCy	46.81%	44.04%	37.86%	49.22%
NLTK	30.97%	28.82%	21.91%	32.12%

Performance

Average training time:

Tool	Categories	Types	Subtypes	Filtered	All
Stanford CoreNLP	11m40s	-	-	5m09s	11h13m
OpenNLP	22s	52s	44s	16s	1h30
SpaCy	3m17s	5m19s	5m20s	2m55s	11h14m
NLTK	2s + 1m56s + 5m55s	2s + 5m23s + 5m54s	2s + 4m25s + 5m52s	2s + 1m12s + 5m58s	24h30m

Notes: The All column represents the amount of training time for every fold + repeats combined for all levels. It is important to note that Stanford CoreNLP only ran for categories and filtered level. And NLTK ran 3 different algorithms for each level, hence the high value for the All column.