DDXPlus: A New Dataset For Automatic Medical Diagnosis

Appearing in NeurIPS 2022 dataset and benchmark track

We are releasing under the CC-BY licence a new large-scale dataset for Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the medical domain.

The dataset contains patients synthesized using a proprietary medical knowledge base and a commercial rule-based ASD system. Patients in the dataset are characterized by their socio-demographic data, a pathology they are suffering from, a set of symptoms and antecedents related to this pathology, and a differential diagnosis. The symptoms and antecedents can be binary, categorical and multi-choice, with the potential of leading to more efficient and natural interactions between ASD/AD systems and patients.

To the best of our knowledge, this is the first large-scale dataset that includes the differential diagnosis, and non-binary symptoms and antecedents.

DDXPlus: A New Dataset For Automatic Medical Diagnosis

Availability

Our paper is available on arXiv.
The dataset in French is hosted on figshare.
- This is the original version of DDXPlus that all results in our paper were obtained on.
Starting from 9 May 2023, the dataset is also available in English for easier use. This version is hosted on figshare.
- The English version of DDXPlus contains the same data in the same format as the French version.
- Wherever possible, English names or non-semantic codes are used instead of French names.
- Using the English version should lead to the same performance as using the French version.

Dataset documentation

In what follows, we use the term evidence as a general term to refer to a symptom or an antecedent. The dataset contains the following files:

release_evidences.json: a JSON file describing all possible evidences considered in the dataset.
release_conditions.json: a JSON file describing all pathologies considered in the dataset.
release_train_patients.zip: a CSV file containing the patients of the training set.
release_validate_patients.zip: a CSV file containing the patients of the validation set.
release_test_patients.zip: a CSV file containing the patients of the test set.

Evidence description

Each evidence in the release_evidences.json file is described using the following entries:

name: name of the evidence.
- In the English version, this is replaced with a unique, non-semantic code starting with E.
code_question: a code allowing to identify which evidences are related. Evidences having the same code_question form a group of related symptoms. The value of the code_question refers to the evidence that need to be simulated/activated for the other members of the group to be eventually simulated.
question_fr: the query, in French, associated to the evidence.
question_en: the query, in English, associated to the evidence.
is_antecedent: a flag indicating whether the evidence is an antecedent or a symptom.
data_type: the type of the evidence. We use "B" for binary, "C" for categorical, and "M" for multi-choice.
default_value: the default value of the evidence. If this value is used to characterize the evidence, then it is as if the evidence was not synthesized.
possible-values: the possible values for the evidence. Only valid for categorical and multi-choice evidences.
- In the English version, every value is replaced with a unique, non-semantic code starting with V.
value_meaning: The meaning, in French and English, of each code that is part of the possible-values field. Only valid for categorical and multi-choice evidences.

Example

English

{
    "name": "E_130",
    "code_question": "E_129",
    "question_fr": "De quelle couleur sont les lésions?",
    "question_en": "What color is the rash?",
    "is_antecedent": false,
    "default_value": "V_11",
    "value_meaning": {
        "V_11": {"fr": "NA", "en": "NA"},
        "V_86": {"fr": "foncée", "en": "dark"},
        "V_107": {"fr": "jaune", "en": "yellow"},
        "V_138": {"fr": "pâle", "en": "pale"},
        "V_156": {"fr": "rose", "en": "pink"},
        "V_157": {"fr": "rouge", "en": "red"}
    },
    "possible-values": [
        "V_11",
        "V_86",
        "V_107",
        "V_138",
        "V_156",
        "V_157"
    ],
    "data_type": "C"
}

French

{
    "name": "lesions_peau_couleur",
    "code_question": "lesions_peau",
    "question_fr": "De quelle couleur sont les lésions?",
    "question_en": "What color is the rash?",
    "is_antecedent": false,
    "default_value": "NA",
    "value_meaning": {
        "NA": {"fr": "NA", "en": "NA"},
        "foncee": {"fr": "foncée", "en": "dark"},
        "jaune": {"fr": "jaune", "en": "yellow"},
        "pale": {"fr": "pâle", "en": "pale"},
        "rose": {"fr": "rose", "en": "pink"},
        "rouge": {"fr": "rouge","en": "red"}
    },
    "possible-values": [
        "NA",
        "foncee",
        "jaune",
        "pale",
        "rose",
        "rouge"
    ],
    "data_type": "C"
}

Pathology description

The file release_conditions.json contains information about the pathologies patients in the datasets may suffer from. Each pathology has the following attributes:

condition_name: name of the pathology.
- In the English version, the English name is used instead of the French name.
cond-name-fr: name of the pathology in French.
cond-name-eng: name of the pathology in English.
icd10-id: ICD-10 code of the pathology.
severity: the severity associated with the pathology. The lower the more severe.
symptoms: data structure describing the set of symptoms characterizing the pathology. Each symptom is represented by its corresponding name entry in the release_evidences.json file.
antecedents: data structure describing the set of antecedents characterizing the pathology. Each antecedent is represented by its corresponding name entry in the release_evidences.json file.

Example

English

{
    "condition_name": "Myasthenia gravis",
    "cond-name-fr": "Myasthénie grave",
    "cond-name-eng": "Myasthenia gravis",
    "icd10-id": "G70.0",
    "symptoms": {
        "E_65": {},
        "E_63": {},
        "E_52": {},
        "E_172": {},
        "E_84": {},
        "E_66": {},
        "E_90": {},
        "E_38": {},
        "E_176": {}
     },
    "antecedents": {
        "E_28": {},
        "E_204": {}
    },
    "severity": 3
}

French

{
    "condition_name": "Myasthénie grave",
    "cond-name-fr": "Myasthénie grave",
    "cond-name-eng": "Myasthenia gravis",
    "icd10-id": "G70.0",
    "symptoms": {
        "dysphagie": {},
        "dysarthrie": {},
        "diplopie": {},
        "ptose": {},
        "faiblesse_msmi": {},
        "dyspn": {},
        "fatigabilité_msk": {},
        "claud_mâchoire": {},
        "rds_paralys_gen": {}
    },
    "antecedents": {
        "atcdfam_mg": {},
        "trav1": {}
    },
    "severity": 3
}

Patient description

Each patient in each of the 3 sets has the following attributes:

AGE: the age of the synthesized patient.
SEX: the sex of the synthesized patient.
PATHOLOGY: name of the ground truth pathology (cf condition_name property in the release_conditions.json file) that the synthesized patient is suffering from.
EVIDENCES: list of evidences experienced by the patient. An evidence can either be binary, categorical or multi-choice. A categorical or multi-choice evidence is represented in the format [evidence-name]_@_[evidence-value] where [evidence-name] is the name of the evidence (name entry in the release_evidences.json file) and [evidence-value] is a value from the possible-values entry. Note that for a multi-choice evidence, it is possible to have several [evidence-name]_@_[evidence-value] items in the evidence list, with each item being associated with a different evidence value. A binary evidence is represented as [evidence-name].
INITIAL_EVIDENCE: the evidence provided by the patient to kick-start an interaction with an ASD/AD system. This is useful during model evaluation for a fair comparison of ASD/AD systems as they will all begin an interaction with a given patient from the same starting point. The initial evidence is randomly selected from the evidence list mentioned above (i.e., EVIDENCES) and it is part of this list.
DIFFERENTIAL_DIAGNOSIS: The ground truth differential diagnosis for the patient. It is represented as a list of pairs of the form [[patho_1, proba_1], [patho_2, proba_2], ...] where patho_i is the pathology name (condition_name entry in the release_conditions.json file) and proba_i is its related probability.

Example

English

{
    "AGE": 18,
    "DIFFERENTIAL_DIAGNOSIS": [["Bronchitis", 0.19171203430383882], ["Pneumonia", 0.17579340398940366], ["URTI", 0.1607809719801254], ["Bronchiectasis", 0.12429044460990353], ["Tuberculosis", 0.11367177304035844], ["Influenza", 0.11057936110639896], ["HIV (initial infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
    "SEX": "M",
    "PATHOLOGY": "URTI",
    "EVIDENCES": ["E_48", "E_50", "E_53", "E_54_@_V_161", "E_54_@_V_183", "E_55_@_V_89", "E_55_@_V_108", "E_55_@_V_167", "E_56_@_4", "E_57_@_V_123", "E_58_@_3", "E_59_@_3", "E_77", "E_79", "E_91", "E_97", "E_201", "E_204_@_V_10", "E_222"],
    "INITIAL_EVIDENCE": "E_91"
}

French

{
    "AGE": 18, 
    "DIFFERENTIAL_DIAGNOSIS": [["Bronchite", 0.19171203430383882], ["Pneumonie", 0.17579340398940366],["IVRS ou virémie", 0.1607809719801254], ["Bronchiectasies", 0.12429044460990353], ["Tuberculose", 0.11367177304035844], ["Possible influenza ou syndrome virémique typique", 0.11057936110639896], ["VIH (Primo-infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]], 
    "SEX": "M", 
    "PATHOLOGY": "IVRS ou virémie", 
    "EVIDENCES": ["crowd", "diaph", "douleurxx", "douleurxx_carac_@_sensible", "douleurxx_carac_@_une_lourdeur_ou_serrement", "douleurxx_endroitducorps_@_front", "douleurxx_endroitducorps_@_joue_D_", "douleurxx_endroitducorps_@_tempe_G_", "douleurxx_intens_@_4", "douleurxx_irrad_@_nulle_part", "douleurxx_precis_@_3", "douleurxx_soudain_@_3", "expecto", "f17.210", "fievre", "gorge_dlr", "toux", "trav1_@_N", "z77.22"], 
    "INITIAL_EVIDENCE": "fievre"
}

Dataset statistics

Pathology statistics

Socio-demographic statistics

Distribution of the evidence types

	Binary	Categorical	Multi-choice	Total
Evidences	208	10	5	223
Symptoms	96	9	5	110
Antecedents	112	1	0	113

Number of evidences of the synthesized patients

	Avg	Std dev	Min	1st quartile	Median	3rd quartile	Max
Evidences	13.56	5.06	1	10	13	17	36
Symptoms	10.07	4.69	1	8	10	12	25
Antecedents	3.49	2.23	0	2	3	5	12

Differential diagnosis statistics

Experiments

Code for reproducing results in the paper can be found in code.

In our paper, we reported results of two methods, a RL-based method AARLC and a supervised method BASD which is adapted from ASD. For instructions on how to run them, see here for AARLC and here for BASD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DDXPlus: A New Dataset For Automatic Medical Diagnosis

Availability

Dataset documentation

Evidence description

Example

Pathology description

Example

Patient description

Example

Dataset statistics

Pathology statistics

Socio-demographic statistics

Distribution of the evidence types

Number of evidences of the synthesized patients

Differential diagnosis statistics

Experiments

Files

README.md

Latest commit

History

README.md

File metadata and controls

DDXPlus: A New Dataset For Automatic Medical Diagnosis

Availability

Dataset documentation

Evidence description

Example

Pathology description

Example

Patient description

Example

Dataset statistics

Pathology statistics

Socio-demographic statistics

Distribution of the evidence types

Number of evidences of the synthesized patients

Differential diagnosis statistics

Experiments