diff --git a/gretel/gc-nlp_text_analysis/README.md b/gretel/gc-nlp_text_analysis/README.md new file mode 100644 index 00000000..b2982365 --- /dev/null +++ b/gretel/gc-nlp_text_analysis/README.md @@ -0,0 +1,5 @@ +# Work Safely with Sensitive Free Text Using Gretel + +Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages. + +At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information. diff --git a/gretel/gc-nlp_text_analysis/blueprint.ipynb b/gretel/gc-nlp_text_analysis/blueprint.ipynb new file mode 100644 index 00000000..4e699263 --- /dev/null +++ b/gretel/gc-nlp_text_analysis/blueprint.ipynb @@ -0,0 +1,272 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -Uqq spacy gretel-client # we install spacy for their visualization helper, displacy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Work Safely with Sensitive Free Text Using Gretel\n", + "\n", + "Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.\n", + "\n", + "At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from gretel_client import get_cloud_client\n", + "\n", + "pd.set_option('max_colwidth', None)\n", + "\n", + "client = get_cloud_client(prefix=\"api\", api_key=\"prompt\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.install_packages()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load the dataset\n", + "\n", + "For this blueprint, we use a modified dataset from the Ubuntu Chat Corpus. It represents an archived set of IRC logs from Ubuntu's technical support channel. This data primarily contains free form text that we will pass through a NER pipeline for labeling and PII discovery." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "source_df = pd.read_csv(\"https://gretel-public-website.s3.us-west-2.amazonaws.com/blueprints/nlp_text_analysis/chat_logs_sampled.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "source_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Label the source text\n", + "\n", + "With the data loaded into the notebook, we now create a Gretel Project, and upload the records to the project for labeling." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project = client.get_project(create=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`detection_mode` configures what models the NER pipeline uses for labeling. Using `detection_mode=all` we configure records to be labeled using all available models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project.send_dataframe(source_df, detection_mode=\"all\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For extra credit, you can navigate to the project's console view to better inspect and visualize the source dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "project.get_console_url()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inspect labeled data\n", + "\n", + "In this next cell, we download the labeled records and inspect each chat message to see what entities were detected. Gretel uses a combination of NLP models, regex, and custom heuristics to detect named entities in structured and unstructured data.\n", + "\n", + "For a list of entities that Gretel can detect, [click here](https://gretel.ai/gretel-cloud-faqs/what-types-of-entities-can-gretel-identify)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from gretel_helpers.spacy import display_entities\n", + "\n", + "TEXT_FIELD = \"text\"\n", + "\n", + "for record in project.iter_records(direction=\"backward\", record_limit=100):\n", + " display_entities(record, TEXT_FIELD)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build a transformation pipeline\n", + "\n", + "After labeling the dataset, we've identified chats that contain PII, such as names and emails. The final step in this blueprint is to build a transformation pipeline that will replace names and other identifying information with fake representations of the data.\n", + "\n", + "We make a point to replace rather than redact sensitive information. This preservation ensures the dataset remains valuable for downstream use cases such as machine learning, where the structure and contents of the data are essential.\n", + "\n", + "To learn more about data transformation pipelines with Gretel, check our [website](https://gretel.ai/platform/transform) or [SDK documentation](https://gretel-client.readthedocs.io/en/stable/transformers/api_ref.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import uuid\n", + "\n", + "from gretel_client.transformers import DataPath, DataTransformPipeline\n", + "from gretel_client.transformers import FakeConstantConfig\n", + "\n", + "SEED = uuid.uuid1().int" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Configure the pipeline. `FakeConstantConfig` will replace any entities configured under `labels` with a fake version of the entity." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fake_xf = FakeConstantConfig(seed=SEED, labels=[\"person_name\", \"email_address\", \"phone_number\"])\n", + "\n", + "paths = [\n", + " DataPath(input=TEXT_FIELD, xforms=[fake_xf]),\n", + " DataPath(input=\"*\"),\n", + "]\n", + "\n", + "pipeline = DataTransformPipeline(paths)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the pipeline to redact any sensitive strings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xf_records = [\n", + " pipeline.transform_record(record)[\"record\"]\n", + " for record in \n", + " project.iter_records(direction=\"backward\")\n", + "]\n", + "\n", + "xf_df = pd.DataFrame(xf_records)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inspect the transformed version of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "xf_df[[TEXT_FIELD]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that you've completed this notebook, you've seen how it's possible to take a corpus of free text, label it using Gretel's NER pipeline, and safely anonymize the dataset while retaining its utility." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/gretel/gc-nlp_text_analysis/manifest.json b/gretel/gc-nlp_text_analysis/manifest.json new file mode 100644 index 00000000..4d51f5fd --- /dev/null +++ b/gretel/gc-nlp_text_analysis/manifest.json @@ -0,0 +1,7 @@ +{ + "name": "Work Safely with Sensitive Free Text Using Gretel", + "description": "Label and anonymize free text chat logs using Gretel NER and NLP pipelines.", + "tags": ["ner", "nlp", "transformers"], + "language": "python", + "featured": false +}