Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Collect annotations for RE model #606

Open
5 tasks
FrancescoCasalegno opened this issue Jun 22, 2022 · 5 comments
Open
5 tasks

Collect annotations for RE model #606

FrancescoCasalegno opened this issue Jun 22, 2022 · 5 comments
Assignees
Labels
🔀 relation-extraction Relation extraction

Comments

@FrancescoCasalegno
Copy link
Contributor

FrancescoCasalegno commented Jun 22, 2022

Actions

  • Get an exhaustive list of all the relations that we want to extract, e.g.
    • (GENE, is-in, BRAIN_REGION)
    • (GENE, is-in, CELL_COMPARTMENT)
    • ...
  • Define number of possible "relation types"1 for each SUBJ_ENT_TYPE, OBJ_ENT_TYPE:
    • is-in: "Purkinje cells are a class of GABAergic inhibitory neurons located in the cerebellum".
    • is-not-in: "Pyramidal neurons are found in forebrain structures such as the cerebral cortex but not in the olfactory bulbs"
    • no info on is-in: "Purkinje cells are a class of GABAergic inhibitory neurons located in the cerebellum, which is smaller than the cerebrum".
  • Run our NER model, and for each pair of entities in a paragraph that could match the subject and object of one of the relations of interest, create a training sample to annotate.
  • Ask expert to annotate paragraphs (should be simple sentence classification). Mask subject and object with the entity type.
  • Evaluate co-occurrence model performance vs. sentence classification for this RE problem.

Dependencies

  • Before we do this we need to have a good NER model working.

Footnotes

  1. relation types = classes of the sequence classification problem

@FrancescoCasalegno
Copy link
Contributor Author

FrancescoCasalegno commented Jul 26, 2022

Planning 2022-07-26

We have taken the following decisions.

  • Annotating a first batch of texts for RE (= this Issue) takes higher priority than annotating a third batch for NER (which we will still have to do anyway, but after getting a first version of RE).
  • We will start collecting annotations just for one relation type, e.g. (GENE, is-in, BRAIN_REGION), so we can test our process for collecting RE annotations and train RE models. We will then, in a second moment, collect annotations also for the other relation types.

We also defined the following action items for this Issue.

  • Decide if it is better to provide to the human annotator masked subject and object spans – e.g. replacing spans with [[GENE]] and <<BRAIN_REGION>> – or just highlight those spans. Notice that approach of masking subject and objects would be consistent with what happens during training/evaluation of a ML model for RE.
  • Collect a set of paragraphs, from several relevant scientific articles, where our NER model predicts the occurrence of both a GENE and a BRAIN_REGION entity. Verify that these NER predicted spans are indeed correct1, and if Yes, then add those paragraphs to the set of paragraphs to be annotated for the RE task by the expert.
  • Set up prodigy and let the human expert start annotating the sentences for RE.

Footnotes

  1. If the NER predictions are wrong, then it does not make sense to collect annotations for the RE task. This is related to the discussions we had about the issues concerning the multiplicative error propagation in our NLP pipeline (Retrieval -> NER -> RE -> Entity Linking).

@FrancescoCasalegno
Copy link
Contributor Author

FrancescoCasalegno commented Aug 3, 2022

Planning 2022-08-02

  • No update in action items from previous planning (see above).
  • For the paragraphs to annotate, we agreed that a possible solution is to consider the same paragraphs used to collect NER annotations. By doing so, we should be sure (up to human errors in the NER annotations!) that the NER stage is already correct and therefore we only have to worry about RE.
  • In fact, we should rather use the "corrected" ground truth NER annotations (see Perform Error Analysis of NER model predictions #607) rather than the "raw" ones provided by the human expert, so that they should be a bit better.

@FrancescoCasalegno
Copy link
Contributor Author

FrancescoCasalegno commented Aug 16, 2022

Planning 2022-08-16

Decisions

  • The approach using "ground truth" NER annotations for RE annotations would not be enough, because only a small subset (?) of those paragraphs contain, e.g. both a GENE and a BRAIN_REGION.
  • Instead, we should use a NER model trained on all our available NER annotations (this model overfits, and therefore is pretty much equivalent to human annotations on this data!) and feed the RE annotations with both sentences used for NER and new sentences.
  • Keep the subject and object content highlighted by [[...]] and <<...>>, respectively. In this way, the expert can REJECT the RE training sample if the NER annotations are incorrect.
  • Use textcat.manual from prodigy to collect NLI annotations.
    • Does the context imply that [[subject]] is-in <<object>>?
    • Options are ENTAILMENT (is-in), CONTRADICTION (is-not-in), NEUTRAL (not info on relation).
    • Also, click on X (REJECT) if NER annotations are wrong and/or text is gibberish and sample should be discarded.

Actions

  • Re-fit an NER model on all our NER annotated data.
  • Perform inference on paragraphs used for NER annotations. How many RE training samples for annotations are obtained this way?
  • Ask human expert to annotate this first batch of RE training samples.
  • If we have < 500 RE samples, collect more paragraphs (where our NER model predicts co-occurrence of given entity types) and ask human expert to annotate also this second batch.

@FrancescoCasalegno
Copy link
Contributor Author

Planning 2022-08-16

Discussion

  • Feedback from scientist tasked of RE annotation is as follows:
    • re.manual (= draw arrows) better than textcat.manual (= click button: is-in, is-not-in, no-relation)
    • is-in relation can occur on more entity type than only (subj=GENE, obj=BRAIN_REGION)
    • the following table summarizes which subj, obj pairs are possible for this relation type:
      subj/obj BRAIN_REGION CELL_COMPARTMENT CELL_TYPE GENE ORGANISM
      BRAIN_REGION X(?) X X X INTERESTED (cannot be IS NOT IN)
      CELL_COMPARTMENT X X INTERESTED (cannot be IS NOT IN) X X
      CELL_TYPE INTERESTED X X X INTERESTED
      GENE INTERESTED INTERESTED INTERESTED X INTERESTED
      ORGANISM X X X X X
    • While NER annotations required expert domain knowledge, maybe RE annotation for a simple relation like is-in can be done also by a someone who is not domain expert.

Action Items

  • Set up annotations for RE using textcat.manual, and including only samples where subj, obj must have entity types among the acceptable ones (see table above).
  • Each person in ML team should do annotations for 30 paragraphs. Out of these, 5 paragraphs will be shared among all annotators, to evaluate inter-rater agreement and as a sanity check.

@FrancescoCasalegno
Copy link
Contributor Author

Update 2022-09-13

Context

Actions

  • Use HTML to highlight with colors SUBJ and OBJ candidates in prodigy GUI.
  • Limit n. of samples per paragraph to 10 (currently 16, but it seemed too much).
  • Collect 200 samples from domain expert.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🔀 relation-extraction Relation extraction
Projects
None yet
Development

No branches or pull requests

3 participants