notebooks/full_pipeline_example.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example PrivacyFingerprint Workflow\n",
    "\n",
    "This notebook walks you through a potential end-to-end workflow, to introduce a user to how each component can be loaded and how they can be configured"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "from spacy import displacy\n",
    "from pycanon.anonymity import k_anonymity, t_closeness, l_diversity\n",
    "\n",
    "path_root = os.path.dirname(os.getcwd())\n",
    "\n",
    "if path_root not in sys.path:\n",
    "    sys.path.append(path_root)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.generate.synthea import GenerateSynthea\n",
    "from src.generate.llm import GenerateLLM\n",
    "from src.extraction.extraction import Extraction\n",
    "from src.standardise_extraction.standardise_extraction import (\n",
    "    StandardiseExtraction,\n",
    ")\n",
    "from src.privacy_risk_scorer.privacy_risk_scorer import PrivacyRiskScorer\n",
    "from src.privacy_risk_explainer.privacy_risk_explainer import (\n",
    "    PrivacyRiskExplainer,\n",
    ")\n",
    "\n",
    "from src.config.experimental_config import load_experimental_config\n",
    "from src.config.global_config import load_global_config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Importing and Loading Global and Experimental Config\n",
    "\n",
    "**global_config_path**  this is the location of the global config path and then the output folder name is redefined to ensure the example experiments are out in the open. (Normally the default output folder should be used for your own experiments.)\n",
    "\n",
    "**default_config_path** is given so the user can point to the default experimental config values. Currently the pipeline copies the original experimental config down into the folder, and if this exists, only uses the experimental config defined in that folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loads global config and redefines the outputs to go to an example_output (normally default to an outputs folder which is git ignored in this repo.)\n",
    "global_config_path = \"../config/global_config.yaml\"\n",
    "global_config = load_global_config(global_config_path)\n",
    "global_config.output_paths.output_folder = \"../example_output\"\n",
    "\n",
    "\n",
    "default_config_path = \"../config/experimental_config.yaml\"\n",
    "experimental_config = load_experimental_config(default_config_path)\n",
    "experimental_config.outputs.experiment_name = \"example_pipeline_05_08_24\"\n",
    "experimental_folder = f\"{global_config.output_paths.output_folder}/{experimental_config.outputs.experiment_name}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Privacy Fingerprint End-to-End Overview\n",
    "\n",
    "The Pipeline has been broken down into four components:\n",
    "1. **GenerateSynthea**: This generates a list of dictionary of synthetic patient records.\n",
    "2. **GenerateLLM**: This generates medical notes using the outputs created from **GenerateSynthea**.\n",
    "3. **Extraction**: This currently uses an LLM that is specialised to extract given entities from the synthetic medical notes produced by **GenerativeLLM**\n",
    "4. **StandardiseExtraction**: This standardises the results extracted from the medical text.\n",
    "5. **PrivacyRiskScorer**: This scores the uniqueness of standardised entity values extracted.\n",
    "6. **PrivacyRiskExplainer**: Takes in the predicted transformed values, and transformed dataset generater from the gaussian copula, and calculates shapley values. \n",
    "\n",
    "Additionally each class will also take a path for the input required to create their output. This allows the user to break-up the pipeline and run from specific points in the pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. GenerateSynthea: Generating Synthetic Patient Data using Synthea "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Synthea-international is an expansion of Synthea, which is an open-source synthetic patient generator that produces de-identified health records for synthetic patients.\n",
    "\n",
    "GenerateSynthea is a class used to run Synthea. You will need to follow the instructions on the README to ensure Synthea is installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "experimental_config.synthea.path_output = f\"{experimental_folder}/synthea.json\"\n",
    "experimental_config.synthea.population_num = \"10\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "output_synthea = GenerateSynthea(\n",
    "    global_config=global_config, syntheaconfig=experimental_config.synthea\n",
    ").run_or_load()\n",
    "output_synthea"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. GenerateLLM: Generating Synthetic Patient Medical Notes "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This component uses a LLM model either hosted via Ollama to generate synthetic patient medical notes depending on the prompt template given below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Prompt Templates used in the Experimental Pipeline.\n",
    "\n",
    "This defines the template you want the generate component to use. In this example we use Llama2, and this is a prompt template that can be used to support this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Functions needed to create prompt templates and save them for the experiments.\n",
    "from src.config.prompt_template_handler import (\n",
    "    save_generate_template_to_json,\n",
    "    load_and_validate_generate_prompt_template,\n",
    ")\n",
    "\n",
    "# We are using llama 3.1 8B, which uses a llama 3 prompt template.\n",
    "# Defines the path of where llama3 template lives in the generate folder.\n",
    "# A llama2 template can also be used. However, you will need to change the experimental config.\n",
    "generate_template_path = (\n",
    "    f\"{global_config.output_paths.generate_template}llama3_template.json\"\n",
    ")\n",
    "\n",
    "# This defines a template used by LLama3\n",
    "generate_template = \"\"\"<|begin_of_text|>\n",
    "<|start_header_id|>system<|end_header_id|>\n",
    "You are a medical student answering an exam question about writing clinical notes for patients.\n",
    "<|eot_id|>\n",
    "\n",
    "<|start_header_id|>user<|end_header_id|>\n",
    "Keep in mind that your answer will be assessed based on incorporating all the provided information and the quality of prose.\n",
    "\n",
    "1. Use prose to write an example clinical note for this patient's doctor.\n",
    "2. Use less than three sentences.\n",
    "3. Do not provide recommendations.\n",
    "4. Use the following information:\n",
    "\n",
    "{data}\n",
    "<|eot_id|>\n",
    "\n",
    "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n",
    "\"\"\"\n",
    "\n",
    "# Saves the template to the path defined.\n",
    "save_generate_template_to_json(\n",
    "    template_str=generate_template, file_path=generate_template_path\n",
    ")\n",
    "\n",
    "# Loads the template so the user can inspect the template saved.\n",
    "loaded_generate_template = load_and_validate_generate_prompt_template(\n",
    "    filename=generate_template_path\n",
    ")\n",
    "print(loaded_generate_template)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This runs GenerateLLM using the synthea output from the previous run, and saves the LLM output to the given path_output_llm.\n",
    "\n",
    "You can set **verbose** to true or false depending on whether you want outputs to print to the screen on run. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "experimental_config.generate.synthea_path = (\n",
    "    experimental_config.synthea.path_output\n",
    ")\n",
    "experimental_config.generate.path_output = f\"{experimental_folder}/llm.json\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "output_llm = GenerateLLM(\n",
    "    global_config=global_config, generateconfig=experimental_config.generate\n",
    ").run_or_load()\n",
    "output_llm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Extraction: Re-extracting Entities from the Patient Medical Notes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This uses a package called GliNER to extract out entities from the synthetic medical notes.\n",
    "\n",
    "Changing inputs to the **experimental_config.extraction.entity_list** allows you to look for more entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "experimental_config.extraction.llm_path = (\n",
    "    experimental_config.generate.path_output\n",
    ")\n",
    "experimental_config.extraction.path_output = (\n",
    "    f\"{experimental_folder}/extraction.json\"\n",
    ")\n",
    "experimental_config.extraction.entity_list = [\n",
    "    \"nhs number\",\n",
    "    \"person\",\n",
    "    \"date of birth\",\n",
    "    \"diagnosis\",\n",
    "]\n",
    "experimental_config.extraction.server_model_type = \"gliner\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "output_extraction = Extraction(\n",
    "    global_config=global_config,\n",
    "    extractionconfig=experimental_config.extraction,\n",
    ").run_or_load()\n",
    "output_extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualising Entities using DisplaCy\n",
    "\n",
    "This visualises the entities in an example clinical note using DisplaCy.\n",
    "\n",
    "We format the extracted entities into a dictionary compatable with DisplaCy, and display the string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for string_id in ids:\n",
    "\n",
    "    ents_dict = {\n",
    "        \"text\": output_llm[string_id],\n",
    "        \"ents\": output_extraction[string_id][\"Entities\"],\n",
    "    }\n",
    "\n",
    "    displacy.render(ents_dict, manual=True, style=\"ent\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. StandardiseExtraction: Normalising Entities Extracted for Scoring"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This takes in the above List of Dictionary entities and begins to normalise the responses into a dataframe format.\n",
    "\n",
    "The standardisation process is broken down into many parts:\n",
    "1. Entities are extracted from the object created from **Extraction**, and a set of functions can be applied to clean them during this process.\n",
    "2. This creates a list of cleaned entities. Multiple entities can be extracted from the same person for a given entity type, for example diagnosis. Currently the codebase only takes the first entity given.\n",
    "3. Next the outputs are normalised i.e. Dates can be written in multiple formats but have the same meaning.\n",
    "4. Lastly the data is encoded and formatted as a numpy array ready for PyCorrectMatch\n",
    "\n",
    "In the src/config.py file:\n",
    "\n",
    "extra_preprocess_functions_per_entity defines how entities are cleaned while extracted from the extraction_output.\n",
    "\n",
    "```\n",
    "extra_preprocess_functions_per_entity = {\"person\": [clean_name.remove_titles]}\n",
    "```\n",
    "\n",
    "standardise_functions_per_entity defines how entities are extracted, and defines any normalisation process you may want on a column of entities.\n",
    "```\n",
    "standardise_functions_per_entity = {\n",
    "    \"person\": [extract_first_entity_from_list],\n",
    "    \"nhs number\": [extract_first_entity_from_list],\n",
    "    \"date of birth\": [\n",
    "        extract_first_entity_from_list,\n",
    "        normalise_columns.normalise_date_column,\n",
    "    ],\n",
    "    \"diagnosis\": [extract_first_entity_from_list],\n",
    "}\n",
    "```\n",
    "\n",
    "This uses the output_extraction value created by the **Extraction** class and saves the outputs of the normalisation process as a .csv to the given path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_output_standardisation = f\"{experimental_folder}/standardisation.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_standards = StandardiseExtraction(\n",
    "    extraction_input=output_extraction,\n",
    "    path_output=path_output_standardisation,\n",
    "    save_output=True,\n",
    ").run()\n",
    "output_standards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This loads an extraction input from the extraction_path provided, and creates the output_standards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_standards = StandardiseExtraction(\n",
    "    extraction_path=experimental_config.extraction.path_output,\n",
    "    path_output=path_output_standardisation,\n",
    "    save_output=False,\n",
    ").run()\n",
    "output_standards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This loads a pre-saved output_standards from the given path provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_standards = StandardiseExtraction(\n",
    "    path_output=path_output_standardisation\n",
    ").load()\n",
    "output_standards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. PrivacyRiskScorer: This scores the uniqueness of standardised entity values extracted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    scorer = PrivacyRiskScorer(\n",
    "        scorer_config=experimental_config.pycorrectmatch\n",
    "    )\n",
    "except ValueError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here we fit the model, this has to happen first before calculating scores or transforming\n",
    "scorer.fit(output_standards)\n",
    "# This is the transformed dataset from the real record values to the marginal values\n",
    "transformed_dataset = scorer.map_records_to_copula(output_standards)\n",
    "N_FEATURES = output_standards.shape[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. PrivacyRiskExplainer: Takes in the predicted transformed values, and transformed dataset generater from the gaussian copula, and calculates shapley values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SHAP takes a while to run - a progress bar appears when running SHAP\n",
    "try:\n",
    "    explainer = PrivacyRiskExplainer(\n",
    "        scorer.predict_transformed,\n",
    "        N_FEATURES,\n",
    "        explainer_config=experimental_config.explainers,\n",
    "    )\n",
    "except ValueError as e:\n",
    "    print(e)\n",
    "\n",
    "# Calculating shapley values using the transformed_dataset\n",
    "local_shapley_df, global_shap, exp_obj = explainer.explain(\n",
    "    transformed_dataset[[\"person\", \"nhs number\", \"date of birth\", \"diagnosis\"]]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "global_shap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the mean shap values - global explanation\n",
    "explainer.plot_global_explanation(exp_obj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the local shap values for a particular record\n",
    "explainer.plot_local_explanation(exp_obj, 49)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. PyCanon \n",
    "\n",
    "Pycanon asseses the values of common privacy measuring metrics, such as K-Anonymity, I-Diversity and t-Closeness. \n",
    "\n",
    "For more information on these metrics see `docs/pycanon/pycanon_and_privacy_metrics.md`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each entity used in extraction should be added to `config/experimental_config.yaml` under `pycanon` in order to be used for the following analysis. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "identifiers = experimental_config.pycanon.identifiers\n",
    "quasi_identifiers = experimental_config.pycanon.quasi_identifiers\n",
    "sensitive_attributes = experimental_config.pycanon.sensitive_attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"K-Anonymity: \", k_anonymity(output_standards, quasi_identifiers))\n",
    "print(\n",
    "    \"t-Closeness: \",\n",
    "    t_closeness(output_standards, quasi_identifiers, sensitive_attributes),\n",
    ")\n",
    "print(\n",
    "    \"l-Diversity: \",\n",
    "    l_diversity(output_standards, quasi_identifiers, sensitive_attributes),\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "privfp-experiments",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}