-
Notifications
You must be signed in to change notification settings - Fork 20
Home
This wiki documents the development process for my master's thesis, named Entity and relation extraction from web content.
First, the HAREM dataset was used to perform NER using available tools, namely Stanford NER, NLTK, OpenNLP and spaCy.
- brat: annotation tool and annotated SIGARRA's news
- datasets: Keeps the datasets used
-
scripts:
- extra: scripts not yet used
- evaluation: scripts to compute the evaluation of all tools, using the conlleval script
-
filter-harem: scripts to manipulate HAREM dataset
- harem-to-opennlp: transform HAREM in opennlp input format
- harem-to-standoff: transform HAREM in standoff format, used in spacy
- harem-to-stanford: transform HAREM in conll format, used in stanfordNER
- src: source files for scripts
- run-scripts: commands to run scripts
-
tools:
- nltk: folder to keep nltk related data/scripts
- open-nlp: folder to keep open-nlp related data/scripts
- spacy: folder to keep spacy related data/scripts
- stanford-ner: folder to keep stanford NER related data/scripts
All programs were intended to be ran across HAREM with four different categories:
- Categories: use only categories
- Types: use only types
- Subtypes: use only subtypes
- Filtered: use filtered categories
Taking into account only the categories, the results, ordered by F-measure, were:
- Stanford CoreNLP: 56.10%
- OpenNLP: 53.63%
- SpaCy: 46.81%
- NLTK: 30.97%
Results for categories:
Tool | Precision | Recall | F-measure |
---|---|---|---|
Stanford CoreNLP | 58.84% | 53.60% | 56.10% |
OpenNLP | 55.43% | 51.94% | 53.63% |
SpaCy | 51.21% | 43.10% | 46.81% |
NLTK | 30.58% | 31.38% | 30.97% |
F-measure for all levels:
Tool | Categories | Types | Subtypes | Filtered |
---|---|---|---|---|
Stanford CoreNLP | 56.10% | - | - | 61.10% |
OpenNLP | 53.63% | 48.53% | 50.74% | 57.44% |
SpaCy | 46.81% | 44.04% | 37.86% | 49.22% |
NLTK | 30.97% | 28.82% | 21.91% | 32.12% |
Average training time:
Tool | Categories | Types | Subtypes | Filtered | All |
---|---|---|---|---|---|
Stanford CoreNLP | 11m40s | - | - | 5m09s | 11h13m |
OpenNLP | 22s | 52s | 44s | 16s | 1h30 |
SpaCy | 3m17s | 5m19s | 5m20s | 2m55s | 11h14m |
NLTK | 2s + 1m56s + 5m55s | 2s + 5m23s + 5m54s | 2s + 4m25s + 5m52s | 2s + 1m12s + 5m58s | 24h30m |
Notes: The All column represents the amount of training time for every fold + repeats combined for all levels. It is important to note that Stanford CoreNLP only ran for categories and filtered level. And NLTK ran 3 different algorithms for each level, hence the high value for the All column.