-
Notifications
You must be signed in to change notification settings - Fork 20
Home
This wiki documents the development process for my master's thesis, named Named entity extraction from Portuguese web text.
First, the HAREM dataset was used to perform NER using available tools, namely Stanford CoreNLP, NLTK, OpenNLP and spaCy. Repeated 10-fold cross validation was used to evaluate all tools, all results are present in this wiki. More info on the HAREM collection on its page.
After evaluation all tools with the baseline configuration, I performed a Hyperparameter study for each tool, this time using repeated holdout cross-validation.
I manually annotated a subset of SIGARRA news, generating a Portuguese corpus with 905 annotated news. And finally, I trained models with each tool with this dataset. More info on the SIGARRA News Corpus on its page.
- brat: annotation tool and annotated SIGARRA's news
- datasets: Keeps the datasets used
-
scripts:
- extra: some extra scripts not directly used
- evaluation: scripts to compute the evaluation of all tools, using the conlleval script
-
filter-harem: scripts to manipulate HAREM dataset
- harem-to-opennlp: transform HAREM in OpenNLP input format
- harem-to-standoff: transform HAREM in standoff format, used in spaCy
- harem-to-stanford: transform HAREM in conll format, used in Stanford CoreNLP
- src: source files for scripts
- run-scripts: commands to run scripts
-
filter-sigarra: scripts to manipulate SIGARRA dataset
- sigarra-to-opennlp: transform SIGARRA in OpenNLP input format
- sigarra-to-standoff: transform SIGARRA in standoff format, used in spaCy
- src: source files for scripts
- run-scripts: commands to run scripts
-
tools:
- nltk: folder to keep NLTK related data/scripts
- open-nlp: folder to keep OpenNLP related data/scripts
- spacy: folder to keep spaCy related data/scripts
- stanford-ner: folder to keep Stanford CoreNLP related data/scripts
All tools were intended to be ran across HAREM with four different entity levels:
- Categories: use only categories
- Types: use only types
- Subtypes: use only subtypes
- Filtered: use filtered categories (subset of categories)