Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blueprint for analyzing free text using Gretel.ai #17

Merged
merged 9 commits into from
Dec 3, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions gretel/gc-nlp_text_analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Working Safely with Sensitive Free Text Using Gretel Cloud and NLP

Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.

At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information.
272 changes: 272 additions & 0 deletions gretel/gc-nlp_text_analysis/blueprint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -Uqq spacy gretel-client # we install spacy for their visualization helper, displacy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working Safely with Sensitive Free Text Using Gretel.ai and NLP\n",
"\n",
"Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.\n",
"\n",
"At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from gretel_client import get_cloud_client\n",
"\n",
"pd.set_option('max_colwidth', None)\n",
"\n",
"client = get_cloud_client(prefix=\"api\", api_key=\"prompt\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client.install_packages()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the dataset\n",
"\n",
"For this blueprint, we use a modified dataset from the Ubuntu Chat Corpus. It represents an archived set of IRC logs from Ubuntu's technical support channel. This data primarily contains free form text that we will pass through a NER pipeline for labeling and PII discovery."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"source_df = pd.read_csv(\"https://gretel-public-website.s3.us-west-2.amazonaws.com/blueprints/nlp_text_analysis/chat_logs_sampled.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"source_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Label the source text\n",
"\n",
"With the data loaded into the notebook, we now create a Gretel Project, and upload the records to the project for labeling."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"project = client.get_project(create=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`detection_mode` configures what models the NER pipeline uses for labeling. Using `detection_mode=all` we configure records to be labeled using all available models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"project.send_dataframe(source_df, detection_mode=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For extra credit, you can navigate to the project's console view to better inspect and visualize the source dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"project.get_console_url()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspect labeled data\n",
"\n",
"In this next cell, we download the labeled records and inspect each chat message to see what entities were detected. Gretel uses a combination of NLP models, regex, and custom heuristics to detect named entities in structured and unstructured data.\n",
"\n",
"For a list of entities that Gretel can detect, [click here](https://gretel.ai/gretel-cloud-faqs/what-types-of-entities-can-gretel-identify)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gretel_helpers.spacy import display_entities\n",
"\n",
"TEXT_FIELD = \"text\"\n",
"\n",
"for record in project.iter_records(direction=\"backward\", record_limit=100):\n",
" display_entities(record, TEXT_FIELD)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build a transformation pipeline\n",
"\n",
"After labeling the dataset, we've identified chats that contain PII, such as names and emails. The final step in this blueprint is to build a transformation pipeline that will replace names and other identifying information with fake representations of the data.\n",
"\n",
"We make a point to replace rather than redact sensitive information. This preservation ensures the dataset remains valuable for downstream use cases such as machine learning, where the structure and contents of the data are essential.\n",
"\n",
"To learn more about data transformation pipelines with Gretel, check our [website](https://gretel.ai/platform/transform) or [SDK documentation](https://gretel-client.readthedocs.io/en/stable/transformers/api_ref.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"\n",
"from gretel_client.transformers import DataPath, DataTransformPipeline\n",
"from gretel_client.transformers import FakeConstantConfig\n",
"\n",
"SEED = uuid.uuid1().int"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Configure the pipeline. `FakeConstantConfig` will replace any entities configured under `labels` with a fake version of the entity."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake_xf = FakeConstantConfig(seed=SEED, labels=[\"person_name\", \"email_address\", \"phone_number\"])\n",
"\n",
"paths = [\n",
" DataPath(input=TEXT_FIELD, xforms=[fake_xf]),\n",
" DataPath(input=\"*\"),\n",
"]\n",
"\n",
"pipeline = DataTransformPipeline(paths)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the pipeline to redact any sensitive strings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xf_records = [\n",
" pipeline.transform_record(record)[\"record\"]\n",
" for record in \n",
" project.iter_records(direction=\"backward\")\n",
"]\n",
"\n",
"xf_df = pd.DataFrame(xf_records)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspect the transformed version of the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"xf_df[[TEXT_FIELD]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you've completed this notebook, you've seen how it's possible to take a corpus of free text, label it using Gretel's NER pipeline, and safely anonymize the dataset while retaining its utility."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
7 changes: 7 additions & 0 deletions gretel/gc-nlp_text_analysis/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"name": "Working Safely with Sensitive Free Text Using Gretel.ai and NLP",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"name": "Working Safely with Sensitive Free Text Using Gretel.ai and NLP",
"name": "Work Safely with Sensitive Free Text Using Gretel and NLP",

Suggesting the change Working -> Work to follow the semantic language of the other blueprints which use primarily verbs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I also dropped the NLP callout and moved it to the description.

Work Safely with Sensitive Free Text Using Gretel

"description": "Label and anonymize free text chat logs.",
"tags": ["ner", "nlp", "transformers"],
"language": "python",
"featured": false
}