-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add blueprint for analyzing free text using Gretel.ai #17
Merged
Changes from 6 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
2241d9a
wip
drew dc266b3
add output
drew 18af489
manifiest + sample data
drew b5f46df
fix mainifest
drew 7932e0d
misc changes
drew f31be79
remove dev identifier
drew d104e6d
Update README.md
drew 814a1f3
Update manifest.json
drew 279c5ef
update nb title
drew File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Working Safely with Sensitive Free Text Using Gretel Cloud and NLP | ||
|
||
Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages. | ||
|
||
At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,272 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install -Uqq spacy gretel-client # we install spacy for their visualization helper, displacy" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Working Safely with Sensitive Free Text Using Gretel.ai and NLP\n", | ||
"\n", | ||
"Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.\n", | ||
"\n", | ||
"At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"from gretel_client import get_cloud_client\n", | ||
"\n", | ||
"pd.set_option('max_colwidth', None)\n", | ||
"\n", | ||
"client = get_cloud_client(prefix=\"api\", api_key=\"prompt\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"client.install_packages()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load the dataset\n", | ||
"\n", | ||
"For this blueprint, we use a modified dataset from the Ubuntu Chat Corpus. It represents an archived set of IRC logs from Ubuntu's technical support channel. This data primarily contains free form text that we will pass through a NER pipeline for labeling and PII discovery." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"source_df = pd.read_csv(\"https://gretel-public-website.s3.us-west-2.amazonaws.com/blueprints/nlp_text_analysis/chat_logs_sampled.csv\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"source_df.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Label the source text\n", | ||
"\n", | ||
"With the data loaded into the notebook, we now create a Gretel Project, and upload the records to the project for labeling." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"project = client.get_project(create=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"`detection_mode` configures what models the NER pipeline uses for labeling. Using `detection_mode=all` we configure records to be labeled using all available models." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"project.send_dataframe(source_df, detection_mode=\"all\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For extra credit, you can navigate to the project's console view to better inspect and visualize the source dataset." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"project.get_console_url()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Inspect labeled data\n", | ||
"\n", | ||
"In this next cell, we download the labeled records and inspect each chat message to see what entities were detected. Gretel uses a combination of NLP models, regex, and custom heuristics to detect named entities in structured and unstructured data.\n", | ||
"\n", | ||
"For a list of entities that Gretel can detect, [click here](https://gretel.ai/gretel-cloud-faqs/what-types-of-entities-can-gretel-identify)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from gretel_helpers.spacy import display_entities\n", | ||
"\n", | ||
"TEXT_FIELD = \"text\"\n", | ||
"\n", | ||
"for record in project.iter_records(direction=\"backward\", record_limit=100):\n", | ||
" display_entities(record, TEXT_FIELD)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Build a transformation pipeline\n", | ||
"\n", | ||
"After labeling the dataset, we've identified chats that contain PII, such as names and emails. The final step in this blueprint is to build a transformation pipeline that will replace names and other identifying information with fake representations of the data.\n", | ||
"\n", | ||
"We make a point to replace rather than redact sensitive information. This preservation ensures the dataset remains valuable for downstream use cases such as machine learning, where the structure and contents of the data are essential.\n", | ||
"\n", | ||
"To learn more about data transformation pipelines with Gretel, check our [website](https://gretel.ai/platform/transform) or [SDK documentation](https://gretel-client.readthedocs.io/en/stable/transformers/api_ref.html)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import uuid\n", | ||
"\n", | ||
"from gretel_client.transformers import DataPath, DataTransformPipeline\n", | ||
"from gretel_client.transformers import FakeConstantConfig\n", | ||
"\n", | ||
"SEED = uuid.uuid1().int" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Configure the pipeline. `FakeConstantConfig` will replace any entities configured under `labels` with a fake version of the entity." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"fake_xf = FakeConstantConfig(seed=SEED, labels=[\"person_name\", \"email_address\", \"phone_number\"])\n", | ||
"\n", | ||
"paths = [\n", | ||
" DataPath(input=TEXT_FIELD, xforms=[fake_xf]),\n", | ||
" DataPath(input=\"*\"),\n", | ||
"]\n", | ||
"\n", | ||
"pipeline = DataTransformPipeline(paths)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Run the pipeline to redact any sensitive strings" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"xf_records = [\n", | ||
" pipeline.transform_record(record)[\"record\"]\n", | ||
" for record in \n", | ||
" project.iter_records(direction=\"backward\")\n", | ||
"]\n", | ||
"\n", | ||
"xf_df = pd.DataFrame(xf_records)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Inspect the transformed version of the dataset." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"xf_df[[TEXT_FIELD]]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that you've completed this notebook, you've seen how it's possible to take a corpus of free text, label it using Gretel's NER pipeline, and safely anonymize the dataset while retaining its utility." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"name": "Working Safely with Sensitive Free Text Using Gretel.ai and NLP", | ||
"description": "Label and anonymize free text chat logs.", | ||
"tags": ["ner", "nlp", "transformers"], | ||
"language": "python", | ||
"featured": false | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggesting the change
Working
->Work
to follow the semantic language of the other blueprints which use primarily verbs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I also dropped the NLP callout and moved it to the description.