Perform Error Analysis of NER model predictions #607

FrancescoCasalegno · 2022-07-12T09:35:50Z

Context

In Compute learning curves for NER models #601 and Compare Huggingface vs. SpaCy performance on NER #602 we saw that our NER model still produces signficant errors in terms of f1-score.
For instance, both CELL_TYPE and CELL_COMPARTMENT have f1-score < 0.60.
Before we try to collect more annotated samples, we should inspect the mispredictions through Error Analysis.
It may be useful to have a look into bluesearch.mining.eval.ner_errors(), which produces reports of False Positive and False Negative errors found in the evaluation.

Actions

Analyse the errors produced by our NER.
Are the discrepancies between y_true and y_pred actual errors or are there mainly subjective differencies (e.g. whether "cell" is a CELL_TYPE)?
What causes CELL_TYPE and CELL_COMPARTMENT in particular to have such bad f1-score?

The text was updated successfully, but these errors were encountered:

EmilieDel · 2022-07-12T14:34:15Z

Results of bluesearch.mining.eval.ner_errors():

(Note that the false negative and false positive results are set, so they don't appear several times if the mistake is detected more than once)

Entity mode

BRAIN_REGION
{'false_neg': {'Cortico - thalamic', 'cortical', 'cortico - cortico', 'cortico - striatal', 'cortico - thalamic', 'retinas'},
 'false_pos': {'cortico', 'dentate gyrus', 'dorsal', 'dorsal telencephalic', 'hippocampal', 'pons', 'pontine', 'striatal', 'thalamic'}}
CELL_COMPARTMENT
{'false_neg': {'axo - somato - dendritic', 'axonal'},
 'false_pos': {'axo', 'axonal guidance', 'dendr', 'dendritic', 'mitochondrial', 'somato'}}
CELL_TYPE
{'false_neg': {'DGGCs', 'GSC', 'PC', 'TWIK-1^−/−', 'cell', 'dentate gyrus granule cells', 'glioma stem cell', 'oligodendrocyte precursor cells'},
 'false_pos': {'- cells', 'AMs', 'Aδ', 'Aδ fibers', 'RGC', 'RGCs', 'astrocytoma', 'brush cells', 'caveolated cells', 'cell', 'cells', 
'fibrillovesicular', 'granule cells', 'multivesicular', 'neural progenitor', 'oligodendrocyte', 'tuft'}}
GENE
{'false_neg': {'ClC-2', 'Wnt'},
 'false_pos': {'- gated sodium channel', 'BDNF', 'EZH2', 'GLI3', 'Kv1.1^mceph', 'NOTCH-1', 'TRAIL', 
'TWIK-1^−/−', 'Tph1', 'Wnt',  'mGluR6', 'voltage - gated ClC-2'}}
ORGANISM
{'false_neg': {'glioma cells', 'Wistar rats', 'mice', 'mouse'},
 'false_pos': {'mouse', 'mice', 'rats', 'rodent', 'human'}}

Token mode

BRAIN_REGION
{'false_neg': {'cortical', 'cortico', '-', 'retinas', 'Cortico'},
 'false_pos': {'dentate', 'dorsal', 'gyrus', 'hippocampal', 'pons', 'pontine', 'telencephalic'}}
CELL_COMPARTMENT
{'false_neg': {'-'}, 
'false_pos': {'guidance', 'dendr', 'mitochondrial'}}
CELL_TYPE
{'false_neg': {'DGGCs', 'GSC', 'PC', 'TWIK-1^−/−', 'cell', 'dentate', 'glioma', 'gyrus', 'precursor', 'stem'},
 'false_pos': {'-', 'AMs', 'Aδ', 'RGC', 'RGCs', 'astrocytoma', 'brush', 'caveolated', 'cell', 'cells', 'fibers', 'fibrillovesicular',
 'multivesicular', 'neural', 'progenitor', 'tuft'}}
GENE
{'false_neg': {'Wnt'},
 'false_pos': {'-', 'BDNF', 'EZH2', 'GLI3', 'Kv1.1^mceph', 'NOTCH-1', 'TRAIL', 'TWIK-1^−/−', 'Tph1', 'Wnt', 'channel',
'gated', 'mGluR6', 'sodium', 'voltage'}}
ORGANISM
{'false_neg': {'mouse', 'mice', 'cells', 'Wistar', 'glioma'},
 'false_pos': {'rodent', 'human', 'mice', 'mouse'}}

jankrepl · 2022-07-12T15:32:09Z

Maybe we should write some check to make sure that once an expert annotates a given word to be a specific entity type then each time this word occurs it will be exactly of the same entity type.

Of course, some words might have multiple meanings (lab mouse, Micky Mouse) but we don't have to worry about it IMO since the context is always very narrow.

FrancescoCasalegno · 2022-07-26T13:20:26Z

2022-08-26 Planning

Track with DVC current NER annotations (= original annotations from expert).
Use regex to check that occurrences are consistent, i.e. that if "mice" is annotated once as ORGANISM it is always annotated as such. If inconsistencies are found, double check by hand and fix the original annotations whenever needed. This will help reducing the False Neg in our evaluation.
Use k-Fold out-of-sample predictions followed by manual verification (with the help of Google/Wikipedia) to see where our model predictions differ from the human annotations, and accordingly fix the original annotations. This will help reducing False Pos and False Neg in our evaluation.

EmilieDel · 2022-07-29T08:07:34Z

Experiment

The idea was to take all the entities annotated by GK and create an entity ruler with all those entities (using a lemmatizer to detect mouse and mice). Once this entity ruler is constructed, the model created is used to predict again on all the annotations. A comparison is done between the GK annotations and the entity ruler predictions.

What was done ?

Several models (en_core_web_sm and en_core_sci_lg) - meaning different tokenizer and lemmatizer - were used.
Conflicts of entity types for the same entity (=lemma) are manually solved (to avoid any randomness in the results). However, this is not scalable.

Results

The results exposed here are created with en_core_web_sm model instantiated the following way

nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.remove_pipe("lemmatizer")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()

914 different patterns are detected (and there was 2014 duplicates (=several times the same lemma)

GENE                529
CELL_TYPE           156
BRAIN_REGION        138
CELL_COMPARTMENT     46
ORGANISM             45

The comparison between GK annotations and the entity ruler predictions (on all the paragraphs annotated)

	precision	recall	f1-score	support
BRAIN_REGION	0.74	0.97	0.84	345
CELL_COMPARTMENT	0.58	0.94	0.72	177
CELL_TYPE	0.48	0.86	0.62	677
GENE	0.87	0.99	0.93	1469
ORGANISM	0.61	0.98	0.75	279

Here are some results:
Also some false positives are appearing using this method.

EmilieDel · 2022-08-02T15:00:20Z

Here are the results of:

Spacy model trained on the annotations from GK
Spacy model trained on the entity ruler annotations

The models are evaluated respectively against the annotations from GK and the entity ruler annotations. The train and test splits are kept the same.

FrancescoCasalegno · 2022-08-03T12:08:13Z

@EmilieDel Awesome, so test score seems to improve signficantly!
Just one thing – after #602 didn't we decide to switch to 🤗 transformers rather than spaCy? Is it possible to see those results?

FrancescoCasalegno · 2022-08-03T15:17:42Z

Planning 2022-08-02

Find out how to export corrected annotations in a format compatible with the one of prodigy.
Save with dvc (put remote on gpfs) original annotations + after entity ruler + after entity ruler and manual correction.
Re-train and evaluate model (k-fold cross-validation) before/after correction. Also inspect which errors are now made (like Perform Error Analysis of NER model predictions #607 (comment))

FrancescoCasalegno · 2022-08-16T08:27:04Z

See plot in #608 (comment)

FrancescoCasalegno added the 🔤 named-entity-recognition NER methods label Jul 12, 2022

FrancescoCasalegno mentioned this issue Aug 3, 2022

Try improving NER model performance by using publicly available NER datasets #608

Closed

3 tasks

FrancescoCasalegno mentioned this issue Aug 16, 2022

Collect annotations for RE model #606

Open

5 tasks

FrancescoCasalegno closed this as completed Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform Error Analysis of NER model predictions #607

Perform Error Analysis of NER model predictions #607

FrancescoCasalegno commented Jul 12, 2022 •

edited by EmilieDel

Loading

EmilieDel commented Jul 12, 2022 •

edited

Loading

jankrepl commented Jul 12, 2022

FrancescoCasalegno commented Jul 26, 2022 •

edited by EmilieDel

Loading

EmilieDel commented Jul 29, 2022 •

edited

Loading

EmilieDel commented Aug 2, 2022 •

edited

Loading

FrancescoCasalegno commented Aug 3, 2022

FrancescoCasalegno commented Aug 3, 2022 •

edited

Loading

FrancescoCasalegno commented Aug 16, 2022

Perform Error Analysis of NER model predictions #607

Perform Error Analysis of NER model predictions #607

Comments

FrancescoCasalegno commented Jul 12, 2022 • edited by EmilieDel Loading

Context

Actions

EmilieDel commented Jul 12, 2022 • edited Loading

Entity mode

Token mode

jankrepl commented Jul 12, 2022

FrancescoCasalegno commented Jul 26, 2022 • edited by EmilieDel Loading

2022-08-26 Planning

EmilieDel commented Jul 29, 2022 • edited Loading

Experiment

Results

EmilieDel commented Aug 2, 2022 • edited Loading

FrancescoCasalegno commented Aug 3, 2022

FrancescoCasalegno commented Aug 3, 2022 • edited Loading

Planning 2022-08-02

FrancescoCasalegno commented Aug 16, 2022

FrancescoCasalegno commented Jul 12, 2022 •

edited by EmilieDel

Loading

EmilieDel commented Jul 12, 2022 •

edited

Loading

FrancescoCasalegno commented Jul 26, 2022 •

edited by EmilieDel

Loading

EmilieDel commented Jul 29, 2022 •

edited

Loading

EmilieDel commented Aug 2, 2022 •

edited

Loading

FrancescoCasalegno commented Aug 3, 2022 •

edited

Loading