Skip to content

Commit

Permalink
Update dataset (#18)
Browse files Browse the repository at this point in the history
  • Loading branch information
drew authored Dec 8, 2020
1 parent 73bbc25 commit 09d9c86
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 14 deletions.
6 changes: 3 additions & 3 deletions gretel/gc-nlp_text_analysis/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Work Safely with Sensitive Free Text Using Gretel
# Work Safely with Free Text Using Gretel

Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.
Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label a set of email dumps looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from the email messages.

At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information.
At the end of the notebook we'll have a dataset that is safe to share and analyze without compromising a user's personal information.
37 changes: 26 additions & 11 deletions gretel/gc-nlp_text_analysis/blueprint.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,18 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install -Uqq spacy gretel-client # we install spacy for their visualization helper, displacy"
"!pip install -Uqq spacy gretel-client datasets # we install spacy for their visualization helper, displacy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Work Safely with Sensitive Free Text Using Gretel\n",
"# Work Safely with Free Text Using Gretel\n",
"\n",
"Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.\n",
"Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label a set of email dumps looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from the email messages.\n",
"\n",
"At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information."
"At the end of the notebook we'll have a dataset that is safe to share and analyze without compromising a user's personal information."
]
},
{
Expand All @@ -34,6 +34,7 @@
"outputs": [],
"source": [
"import pandas as pd\n",
"import datasets\n",
"from gretel_client import get_cloud_client\n",
"\n",
"pd.set_option('max_colwidth', None)\n",
Expand All @@ -47,7 +48,7 @@
"metadata": {},
"outputs": [],
"source": [
"client.install_packages()"
"client.install_packages(version=\"dev\")"
]
},
{
Expand All @@ -56,7 +57,7 @@
"source": [
"## Load the dataset\n",
"\n",
"For this blueprint, we use a modified dataset from the Ubuntu Chat Corpus. It represents an archived set of IRC logs from Ubuntu's technical support channel. This data primarily contains free form text that we will pass through a NER pipeline for labeling and PII discovery."
"Using Hugging Face's [datasets](https://github.com/huggingface/datasets) library, we load a dataset containing a dump of [Enron emails](https://huggingface.co/datasets/aeslc). This data contains unstructured emails that we will pass through a NER pipeline for labeling and PII discovery."
]
},
{
Expand All @@ -65,7 +66,8 @@
"metadata": {},
"outputs": [],
"source": [
"source_df = pd.read_csv(\"https://gretel-public-website.s3.us-west-2.amazonaws.com/blueprints/nlp_text_analysis/chat_logs_sampled.csv\")"
"source_dataset = datasets.load_dataset(\"aeslc\")\n",
"source_df = pd.DataFrame(source_dataset[\"train\"]).sample(n=300, random_state=99)"
]
},
{
Expand Down Expand Up @@ -146,9 +148,9 @@
"source": [
"from gretel_helpers.spacy import display_entities\n",
"\n",
"TEXT_FIELD = \"text\"\n",
"TEXT_FIELD = \"email_body\"\n",
"\n",
"for record in project.iter_records(direction=\"backward\", record_limit=100):\n",
"for record in project.iter_records(direction=\"backward\", record_limit=5):\n",
" display_entities(record, TEXT_FIELD)"
]
},
Expand Down Expand Up @@ -228,7 +230,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspect the transformed version of the dataset."
"Inspect a transformed email from the dataset."
]
},
{
Expand All @@ -237,7 +239,20 @@
"metadata": {},
"outputs": [],
"source": [
"xf_df[[TEXT_FIELD]]"
"from gretel_client.demo_helpers import show_record_diff\n",
"\n",
"\n",
"# Lookup the comparison email by subject line.\n",
"c_key = \"subject_line\"\n",
"c_value = \"Confidentiality Agreement-Human Code\"\n",
"\n",
"# The comparison email contains multiple lines. For this\n",
"# demonstration we only want to examine the first line \n",
"# so we strip any extraneous newlines.\n",
"orig = source_df[source_df[c_key] == c_value][TEXT_FIELD].iloc[0].split(\"\\n\")[0]\n",
"xf = xf_df[xf_df[c_key] == c_value][TEXT_FIELD].iloc[0].split(\"\\n\")[0]\n",
"\n",
"show_record_diff({\"\": orig}, {\"\": xf})"
]
},
{
Expand Down

0 comments on commit 09d9c86

Please sign in to comment.