Home

Categories: use only categories
Types: use only types
Subtypes: use only subtypes
Filtered: use filtered categories (subset of categories)

This wiki documents the development process for my master's thesis, named Named entity extraction from Portuguese web text.

First, the HAREM dataset was used to perform NER using available tools, namely Stanford CoreNLP, NLTK, OpenNLP and spaCy. Repeated 10-fold cross validation was used to evaluate all tools, all results are present in this wiki. More info on the HAREM collection on its page.

After evaluation all tools with the baseline configuration, I performed a Hyperparameter study for each tool, this time using repeated holdout cross-validation.

I manually annotated a subset of SIGARRA news, generating a Portuguese corpus with 905 annotated news. And finally, I trained models with each tool with this dataset. More info on the SIGARRA News Corpus on its page.

Main repository folders

brat: annotation tool and annotated SIGARRA's news
datasets: Keeps the datasets used
scripts:
- extra: some extra scripts not directly used
- evaluation: scripts to compute the evaluation of all tools, using the conlleval script
- filter-harem: scripts to manipulate HAREM dataset
  - harem-to-opennlp: transform HAREM in OpenNLP input format
  - harem-to-standoff: transform HAREM in standoff format, used in spaCy
  - harem-to-stanford: transform HAREM in conll format, used in Stanford CoreNLP
  - src: source files for scripts
  - run-scripts: commands to run scripts
- filter-sigarra: scripts to manipulate SIGARRA dataset
  - sigarra-to-opennlp: transform SIGARRA in OpenNLP input format
  - sigarra-to-standoff: transform SIGARRA in standoff format, used in spaCy
  - src: source files for scripts
  - run-scripts: commands to run scripts
tools:
- nltk: folder to keep NLTK related data/scripts
- open-nlp: folder to keep OpenNLP related data/scripts
- spacy: folder to keep spaCy related data/scripts
- stanford-ner: folder to keep Stanford CoreNLP related data/scripts