Results overview

Results for HAREM

Taking into account only the categories, the results, ordered by F-measure, were:

Stanford CoreNLP: 56.10%
OpenNLP: 53.63%
SpaCy: 46.81%
NLTK: 30.97%

Results for categories:

Tool	Precision	Recall	F-measure
Stanford CoreNLP	58.84%	53.60%	56.10%
OpenNLP	55.43%	51.94%	53.63%
SpaCy	51.21%	43.10%	46.81%
NLTK	30.58%	31.38%	30.97%

F-measure for all levels:

Tool	Categories	Types	Subtypes	Filtered
Stanford CoreNLP	56.10%	-	-	61.10%
OpenNLP	53.63%	48.53%	50.74%	57.44%
SpaCy	46.81%	44.04%	37.86%	49.22%
NLTK	30.97%	28.82%	21.91%	32.12%

Performance

Average training time:

Tool	Categories	Types	Subtypes	Filtered	All
Stanford CoreNLP	11m40s	-	-	5m09s	11h13m
OpenNLP	22s	52s	44s	16s	1h30
SpaCy	3m17s	5m19s	5m20s	2m55s	11h14m
NLTK	2s + 1m56s + 5m55s	2s + 5m23s + 5m54s	2s + 4m25s + 5m52s	2s + 1m12s + 5m58s	24h30m

Notes: The All column represents the amount of training time for every fold + repeats combined for all levels. It is important to note that Stanford CoreNLP only ran for categories and filtered level. And NLTK ran 3 different algorithms for each level, hence the high value for the All column.

Hyperparameter study results

Tool	Default F-measure	Best configurations	Best F-measure
Stanford CoreNLP	54.14%	tolerance=1e-3	54.31%
OpenNLP	50.90%	cutoff=4	52.38%
OpenNLP	50.90%	iterations=170	51.52%
SpaCy	54.70	iterations=110	46.60%
NLTK DT	26.14%	entropy_cutoff=0.08	26.63%
NLTK DT	26.14%	support_cutoff=16	26.18%
NLTK ME	1.11%	min_lldelta=0, iterations=100	35.24%