Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Add more than one acceptable answer in our Question-Answering dataset #617

Open
2 tasks
FrancescoCasalegno opened this issue Aug 24, 2022 · 1 comment
Open
2 tasks
Labels
↩️ question-answering Attribute values extraction using QA models

Comments

@FrancescoCasalegno
Copy link
Contributor

FrancescoCasalegno commented Aug 24, 2022

Context

  • In Question-Answering: collect example questions + run first analysis with pre-trained QA models #612 we started the creation of a Question-Answering dataset for Extractive QA task evaluation/training.
  • This was done following the style of popular datasets for Extractive QA like SQuAD.
  • However, for the sake of simplicity, we only annotated one ground-truth answer for each sample.
  • In SQuAD, more than one ground-truth answer is annotated, and we should do the same because otherwise our evaluation (EM score in particular, but also F1) is biased by the lack of a complete set of acceptable answers.
  • For instance
    • Question: "It is estimated that about 200,000 people live in Geneva."
    • Context: "What is the population size of Geneva?"
    • Acceptable Answers: ["200,000", "about 200,000", "200,000 people", "about 200,000 people"]

Actions

  • Add to our QA dataset more than one acceptable answer for each sample.
  • Re-run evaluation and compare results. We should expect higher scores for all models.
@FrancescoCasalegno
Copy link
Contributor Author

FrancescoCasalegno commented Sep 12, 2022

We can use the same JSON format of SQuAD, as follows. Notice that id=0 has more than one answer, while id=1 has none.

{
    "data": [
        {
            "id": "0",
            "title": "henry-markram-0",
            "context": "The complexity of the central nervous system is amazing: there are approximately 100 billion neurons in the brain and spinal cord combined. As many as 10,000 different subtypes of neurons have been identified, each specialized to send and receive certain types of information. Each neuron is made up of a cell body, which houses the nucleus. Axons and dendrites form extensions from the cell body.",
            "question": "What is the number of neurons of the central nervous system?",
            "answers": {
                "text": [
                    "100 billion",
                    "approximately 100 billion",
                    "100 billion neurons",
                    "approximately 100 billion neurons"
                ],
                "answer_start": [
                    81,
                    67,
                    81,
                    67
                ]
            }
        },
        {
            "id": "1",
            "title": "henry-markram-1",
            "context": "The complexity of the central nervous system is amazing: there are approximately 100 billion neurons in the brain and spinal cord combined. As many as 10,000 different subtypes of neurons have been identified, each specialized to send and receive certain types of information. Each neuron is made up of a cell body, which houses the nucleus. Axons and dendrites form extensions from the cell body.",
            "question": "What is the number of neurons of the spinal cord?",
            "answers": {
                "text": [],
                "answer_start": []
            }
        },
...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
↩️ question-answering Attribute values extraction using QA models
Projects
None yet
Development

No branches or pull requests

1 participant