Skip to content
André Pires edited this page Apr 21, 2017 · 36 revisions

This wiki documents the development process for my master's thesis, named Entity and relation extraction from web content.

First, the HAREM dataset was used to perform NER using available tools, namely Stanford NER, NLTK, OpenNLP and spaCy.

Main repository folders

All programs were intended to be ran across HAREM with four different categories:

  • Categories: use only categories
  • Types: use only types
  • Subtypes: use only subtypes
  • Filtered: use filtered categories

Results

Taking into account only the categories, the results, ordered by F-measure, were:

  • Stanford CoreNLP: 53.53%
  • OpenNLP: 53.63%
  • SpaCy: 46.81%
  • NLTK: 28.33%

Results for categories:

Tool Precision Recall F-measure
Stanford CoreNLP 55.67% 51.05% 53.22%
OpenNLP 55.43% 51.94% 53.63%
SpaCy 51.21% 43.10% 46.81%
NLTK 28.16% 28.64% 28.33%

F-measure for all levels:

Tool Categories Types Subtypes Filtered
Stanford CoreNLP 53.22% - - 58.82%
OpenNLP 53.63% 48.53% 50.74% 57.44%
SpaCy 46.81% 44.04% 37.86% 49.22%
NLTK 28.33% 24.88% 20.08% 30.32%
Clone this wiki locally