Add PII Replay Notebook (#482)

* Add notebook * Add Colab button
gretelai · Nov 22, 2024 · 4c0467d · 4c0467d
1 parent 7a8267b
commit 4c0467d
Showing 1 changed file with 307 additions and 0 deletions.
diff --git a/docs/notebooks/evaluate/evaluate_with_pii_replay.ipynb b/docs/notebooks/evaluate/evaluate_with_pii_replay.ipynb
@@ -0,0 +1,307 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/evaluate/evaluate_with_pii_replay.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "wC0gdO0MBB6Z"
+   },
+   "source": [
+    "# PII Replay Notebook"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gEshiwnUBB6b"
+   },
+   "source": [
+    "## 💾 Install Gretel SDK"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "kNjlJmekBB6c"
+   },
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install -U gretel-client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "yEaV4lufBB6c"
+   },
+   "source": [
+    "## 🌐 Configure your Gretel Session"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "vr2AuZdB0lpu"
+   },
+   "outputs": [],
+   "source": [
+    "from gretel_client import Gretel\n",
+    "\n",
+    "gretel = Gretel(api_key=\"prompt\", validate=True, project_name=\"pii-replay-project\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "38PJc6Q5BB6d"
+   },
+   "source": [
+    "## 🔬 Preview input data\n",
+    "Dataset is taken from https://www.kaggle.com/datasets/ravindrasinghrana/employeedataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "h-VME1ijBB6d"
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "datasource = \"https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/kaggle/employee_data.csv\"\n",
+    "df = pd.read_csv(datasource)\n",
+    "test_df = None\n",
+    "\n",
+    "# Drop columns to simplify example\n",
+    "df = df.drop(columns=[\"Supervisor\", \"BusinessUnit\", \"EmployeeType\", \"PayZone\", \"EmployeeClassificationType\", \"TerminationType\", \"TerminationDescription\", \"DepartmentType\", \"JobFunctionDescription\", \"DOB\", \"LocationCode\", \"RaceDesc\", \"MaritalDesc\"])\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Kvir2xgr_vp9"
+   },
+   "source": [
+    "## ✂ Split train and test\n",
+    "In order to run [Membership Inference Protection](https://docs.gretel.ai/optimize-synthetic-data/evaluate/synthetic-data-quality-report#membership-inference-protection) in Evaluate, we separate out test_df separately from df:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5Iaxcqxr0lpw"
+   },
+   "outputs": [],
+   "source": [
+    "# Shuffle the dataset randomly to ensure a random test set\n",
+    "# Set random_state to ensure reproducibility\n",
+    "shuffled_df = df.sample(frac=1, random_state=42).reset_index(drop=True)\n",
+    "\n",
+    "# Split into test (5% holdout) and train\n",
+    "split_index = int(len(shuffled_df) * 0.05)\n",
+    "test_df = shuffled_df.iloc[:split_index]\n",
+    "train_df = shuffled_df.iloc[split_index:]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "spU-n98d0lpw"
+   },
+   "source": [
+    "## 🏋️‍♂️ Train a generative model\n",
+    "\n",
+    "- The [navigator-ft](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/navigator-ft.yml) base config tells Gretel we want to train with **Navigator Fine Tuning** using its default parameters.\n",
+    "\n",
+    "- **Navigator Fine Tuning** is an LLM under the hood. Before training begins, information about how the input data was tokenized and assembled into examples will be logged in the cell output (as well as in Gretel's Console).\n",
+    "\n",
+    "- Generation of a dataset for evaluation will begin immediately after the model completes training. The rate at which the model produces valid records will be logged to help assess how well the model is performing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "SeiHsaq_0lpw"
+   },
+   "outputs": [],
+   "source": [
+    "nav_ft_trained = gretel.submit_train(\"navigator-ft\", data_source=train_df, evaluate={\"skip\": True}, generate={\"num_records\": 1000})\n",
+    "nav_ft_result = nav_ft_trained.fetch_report_synthetic_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Rg8O12Ae0lpx"
+   },
+   "source": [
+    "## 󠁘🟰 Evaluate PII Replay for Model result without Transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "crzAGFdw0lpx"
+   },
+   "outputs": [],
+   "source": [
+    "EVALUATE_CONFIG = \"\"\"\n",
+    "schema_version: \"1.0\"\n",
+    "\n",
+    "name: \"evaluate-config\"\n",
+    "models:\n",
+    "  - evaluate:\n",
+    "      data_source: \"__tmp__\"\n",
+    "      pii_replay:\n",
+    "        skip: false\n",
+    "        entities: [\"first_name\",\"last_name\",\"email\",\"state\"]\n",
+    "\"\"\"\n",
+    "evaluate_report = gretel.submit_evaluate(EVALUATE_CONFIG, data_source=nav_ft_result, ref_data=train_df, test_data=test_df).evaluate_report\n",
+    "evaluate_report.display_in_notebook()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "VUjeSIx_BB6d"
+   },
+   "source": [
+    "## 󠁘🔀 Define Transform Configuration and Train Transform Model\n",
+    "\n",
+    "- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.\n",
+    "\n",
+    "- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "yn0Z0Z2qBB6e"
+   },
+   "outputs": [],
+   "source": [
+    "TRANSFORM_CONFIG = \"\"\"\n",
+    "schema_version: \"1.0\"\n",
+    "name: transform-config\n",
+    "models:\n",
+    "  - transform_v2:\n",
+    "      globals:\n",
+    "        locales:\n",
+    "          - en_US\n",
+    "        classify:\n",
+    "          enable: true\n",
+    "          entities:\n",
+    "            - first_name\n",
+    "            - last_name\n",
+    "            - email\n",
+    "            - state\n",
+    "          auto_add_entities: true\n",
+    "          num_samples: 3\n",
+    "      steps:\n",
+    "        - rows:\n",
+    "            update:\n",
+    "              - name: FirstName\n",
+    "                value: fake.first_name_male() if row['GenderCode'] == 'Male' else\n",
+    "                  fake.first_name_female()\n",
+    "              - name: LastName\n",
+    "                value: fake.last_name()\n",
+    "              - name: ADEmail\n",
+    "                value: row[\"FirstName\"] + \".\" + row[\"LastName\"] + \"@bilearner.com\"\n",
+    "              - name: State\n",
+    "                value: fake.state_abbr()\n",
+    "\"\"\"\n",
+    "transform_result = gretel.submit_transform(TRANSFORM_CONFIG, data_source=train_df).transformed_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "wVCzIKcL0lpx"
+   },
+   "source": [
+    "## 🏋️‍♂️ Train a generative model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "_BCXWOoX0lpx"
+   },
+   "outputs": [],
+   "source": [
+    "tr_nav_ft_trained = gretel.submit_train(\"navigator-ft\", data_source=transform_result, evaluate={\"skip\": True}, generate={\"num_records\": 1000})\n",
+    "tr_nav_ft_result = tr_nav_ft_trained.fetch_report_synthetic_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "oKb3HNuCBB6e"
+   },
+   "source": [
+    "## 󠁘🟰 Evaluate PII Replay for Transform + Model result\n",
+    "In general, we expect that running Transform prior to Synthetics should decrease PII replay. We can see this by comparing the results below to the results running Synthetics without Transform earlier in the notebook. Note that given the stochastic nature of the algorithm, this could vary each time.\n",
+    "\n",
+    "Note that there are many cases where we should not necessarily expect (or often even want) PII Replay of 0 across the board, even when running Transform first.\n",
+    "\n",
+    "You should consider each column in context, both of the data and the real world. In general, you should expect entities that are rarer, like full address or full name, to have lower amounts of PII replay than entities that are more common, like first name or US state."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "e0thLoOS0lpx"
+   },
+   "outputs": [],
+   "source": [
+    "evaluate_report = gretel.submit_evaluate(EVALUATE_CONFIG, data_source=tr_nav_ft_result, ref_data=train_df, test_data=test_df).evaluate_report\n",
+    "evaluate_report.display_in_notebook()"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 with Fil",
+   "language": "python",
+   "name": "filprofile"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}