Collect annotations for RE model #606

FrancescoCasalegno · 2022-06-22T23:34:02Z

Actions

Get an exhaustive list of all the relations that we want to extract, e.g.
- (GENE, is-in, BRAIN_REGION)
- (GENE, is-in, CELL_COMPARTMENT)
- ...
Define number of possible "relation types"¹ for each SUBJ_ENT_TYPE, OBJ_ENT_TYPE:
- is-in: "Purkinje cells are a class of GABAergic inhibitory neurons located in the cerebellum".
- is-not-in: "Pyramidal neurons are found in forebrain structures such as the cerebral cortex but not in the olfactory bulbs"
- no info on is-in: "Purkinje cells are a class of GABAergic inhibitory neurons located in the cerebellum, which is smaller than the cerebrum".
Run our NER model, and for each pair of entities in a paragraph that could match the subject and object of one of the relations of interest, create a training sample to annotate.
Ask expert to annotate paragraphs (should be simple sentence classification). Mask subject and object with the entity type.
Evaluate co-occurrence model performance vs. sentence classification for this RE problem.

Dependencies

Before we do this we need to have a good NER model working.

relation types = classes of the sequence classification problem ↩

The text was updated successfully, but these errors were encountered:

FrancescoCasalegno · 2022-07-26T15:48:05Z

Planning 2022-07-26

We have taken the following decisions.

Annotating a first batch of texts for RE (= this Issue) takes higher priority than annotating a third batch for NER (which we will still have to do anyway, but after getting a first version of RE).
We will start collecting annotations just for one relation type, e.g. (GENE, is-in, BRAIN_REGION), so we can test our process for collecting RE annotations and train RE models. We will then, in a second moment, collect annotations also for the other relation types.

We also defined the following action items for this Issue.

Decide if it is better to provide to the human annotator masked subject and object spans – e.g. replacing spans with [[GENE]] and <<BRAIN_REGION>> – or just highlight those spans. Notice that approach of masking subject and objects would be consistent with what happens during training/evaluation of a ML model for RE.
Collect a set of paragraphs, from several relevant scientific articles, where our NER model predicts the occurrence of both a GENE and a BRAIN_REGION entity. Verify that these NER predicted spans are indeed correct¹, and if Yes, then add those paragraphs to the set of paragraphs to be annotated for the RE task by the expert.
Set up prodigy and let the human expert start annotating the sentences for RE.

If the NER predictions are wrong, then it does not make sense to collect annotations for the RE task. This is related to the discussions we had about the issues concerning the multiplicative error propagation in our NLP pipeline (Retrieval -> NER -> RE -> Entity Linking). ↩

FrancescoCasalegno · 2022-08-03T15:36:24Z

Planning 2022-08-02

No update in action items from previous planning (see above).
For the paragraphs to annotate, we agreed that a possible solution is to consider the same paragraphs used to collect NER annotations. By doing so, we should be sure (up to human errors in the NER annotations!) that the NER stage is already correct and therefore we only have to worry about RE.
In fact, we should rather use the "corrected" ground truth NER annotations (see Perform Error Analysis of NER model predictions #607) rather than the "raw" ones provided by the human expert, so that they should be a bit better.

FrancescoCasalegno · 2022-08-16T08:24:58Z

Planning 2022-08-16

Decisions

The approach using "ground truth" NER annotations for RE annotations would not be enough, because only a small subset (?) of those paragraphs contain, e.g. both a GENE and a BRAIN_REGION.
Instead, we should use a NER model trained on all our available NER annotations (this model overfits, and therefore is pretty much equivalent to human annotations on this data!) and feed the RE annotations with both sentences used for NER and new sentences.
Keep the subject and object content highlighted by [[...]] and <<...>>, respectively. In this way, the expert can REJECT the RE training sample if the NER annotations are incorrect.
Use textcat.manual from prodigy to collect NLI annotations.
- Does the context imply that [[subject]] is-in <<object>>?
- Options are ENTAILMENT (is-in), CONTRADICTION (is-not-in), NEUTRAL (not info on relation).
- Also, click on X (REJECT) if NER annotations are wrong and/or text is gibberish and sample should be discarded.

Actions

Re-fit an NER model on all our NER annotated data.
Perform inference on paragraphs used for NER annotations. How many RE training samples for annotations are obtained this way?
Ask human expert to annotate this first batch of RE training samples.
If we have < 500 RE samples, collect more paragraphs (where our NER model predicts co-occurrence of given entity types) and ask human expert to annotate also this second batch.

FrancescoCasalegno · 2022-08-23T09:46:44Z

Planning 2022-08-16

Discussion

Feedback from scientist tasked of RE annotation is as follows:

re.manual (= draw arrows) better than textcat.manual (= click button: is-in, is-not-in, no-relation)
is-in relation can occur on more entity type than only (subj=GENE, obj=BRAIN_REGION)

the following table summarizes which subj, obj pairs are possible for this relation type:

subj/obj	BRAIN_REGION	CELL_COMPARTMENT	CELL_TYPE	GENE	ORGANISM
BRAIN_REGION	X(?)	X	X	X	INTERESTED (cannot be IS NOT IN)
CELL_COMPARTMENT	X	X	INTERESTED (cannot be IS NOT IN)	X	X
CELL_TYPE	INTERESTED	X	X	X	INTERESTED
GENE	INTERESTED	INTERESTED	INTERESTED	X	INTERESTED
ORGANISM	X	X	X	X	X

While NER annotations required expert domain knowledge, maybe RE annotation for a simple relation like is-in can be done also by a someone who is not domain expert.

Action Items

Set up annotations for RE using textcat.manual, and including only samples where subj, obj must have entity types among the acceptable ones (see table above).
Each person in ML team should do annotations for 30 paragraphs. Out of these, 5 paragraphs will be shared among all annotators, to evaluate inter-rater agreement and as a sanity check.

FrancescoCasalegno · 2022-09-13T07:39:26Z

Update 2022-09-13

Context

Preliminary results seems to show that even if RE annotations are coming from us (ML team) instead of a domain expert, we already get good performance. Train and evaluate first RE models for (GENE, is-in, BRAIN_REGION) #613
Before we close this Issue, we just want to collect 200 samples from the domain expert as well.

Actions

Use HTML to highlight with colors SUBJ and OBJ candidates in prodigy GUI.
Limit n. of samples per paragraph to 10 (currently 16, but it seemed too much).
Collect 200 samples from domain expert.

FrancescoCasalegno added the 🔀 relation-extraction Relation extraction label Jun 22, 2022

FrancescoCasalegno assigned FrancescoCasalegno, jankrepl and EmilieDel and unassigned FrancescoCasalegno Jul 26, 2022

FrancescoCasalegno mentioned this issue Aug 16, 2022

Complete training set for NER by collecting more samples #604

Closed

5 tasks

This was referenced Aug 16, 2022

Collect batch n. 3 of NER annotations #611

Open

Train and evaluate first RE models for (GENE, is-in, BRAIN_REGION) #613

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect annotations for RE model #606

Collect annotations for RE model #606

FrancescoCasalegno commented Jun 22, 2022 •

edited by jankrepl

Loading

FrancescoCasalegno commented Jul 26, 2022 •

edited by jankrepl

Loading

FrancescoCasalegno commented Aug 3, 2022 •

edited

Loading

FrancescoCasalegno commented Aug 16, 2022 •

edited

Loading

FrancescoCasalegno commented Aug 23, 2022

FrancescoCasalegno commented Sep 13, 2022

Collect annotations for RE model #606

Collect annotations for RE model #606

Comments

FrancescoCasalegno commented Jun 22, 2022 • edited by jankrepl Loading

Actions

Dependencies

Footnotes

FrancescoCasalegno commented Jul 26, 2022 • edited by jankrepl Loading

Planning 2022-07-26

Footnotes

FrancescoCasalegno commented Aug 3, 2022 • edited Loading

Planning 2022-08-02

FrancescoCasalegno commented Aug 16, 2022 • edited Loading

Planning 2022-08-16

FrancescoCasalegno commented Aug 23, 2022

Planning 2022-08-16

Discussion

Action Items

FrancescoCasalegno commented Sep 13, 2022

Update 2022-09-13

Context

Actions

FrancescoCasalegno commented Jun 22, 2022 •

edited by jankrepl

Loading

FrancescoCasalegno commented Jul 26, 2022 •

edited by jankrepl

Loading

FrancescoCasalegno commented Aug 3, 2022 •

edited

Loading

FrancescoCasalegno commented Aug 16, 2022 •

edited

Loading