From 3e6ec09b549283b568d4a90e3b557053f055a932 Mon Sep 17 00:00:00 2001 From: Vadim Lyakhovich Date: Tue, 28 Jun 2022 15:49:42 -0700 Subject: [PATCH 1/6] Added link to the blog --- ingest_data/sagemaker-keyspaces/README.md | 24 +++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/ingest_data/sagemaker-keyspaces/README.md b/ingest_data/sagemaker-keyspaces/README.md index adbfacd458..f04c363bc4 100644 --- a/ingest_data/sagemaker-keyspaces/README.md +++ b/ingest_data/sagemaker-keyspaces/README.md @@ -1,24 +1,28 @@ # Train Machine Learning Models using Amazon Keyspaces as a Data Source -In this notebook we will provide step-by-step instruction to use SageMaker to ingest customer data from Amazon Keyspaces and train a clustering model that allowed you to segment customers. You could use this information for targeted marketing, greatly improving your business KPI. +Please read [Train machine learning models using Amazon Keyspaces as a data source](https://aws.amazon.com/blogs/machine-learning/train-machine-learning-models-using-amazon-keyspaces-as-a-data-source/) blog for more detailed instructions to run this notebook. + + +We will provide step-by-step instructions to use SageMaker to ingest customer data from Amazon Keyspaces and train a clustering model that would enable you to segment customers. This information can be used for targeted marketing, greatly improving your business KPI. 1. First, we install Sigv4 driver to connect to Amazon Keyspaces > The Amazon Keyspaces SigV4 authentication plugin for Cassandra client drivers enables you to authenticate calls to Amazon Keyspaces ***using IAM access keys instead of user name and password***. To learn more about how the Amazon Keyspaces SigV4 plugin enables [`IAM users, roles, and federated identities`](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) to authenticate in Amazon Keyspaces API requests, see [`AWS Signature Version 4 process (SigV4)`](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html) -2. Next, we establish a connection to Amazon Keyspaces -3. Next, we create new Keyspace ***blog_(yyyymmdd)*** and a new table ***online_retail*** -3. Next, we will download retail data about customers. -3. Next, we will ingest retail data about customers into Keyspaces. -3. Next, we use a notebook available within SageMaker Studio to collect data from Keyspaces database, and prepare data for training using KNN Algorithm. Most of our customers use SageMaker Studio for end to end development of ML Use Cases. They could use this notebook as a base and customize it quickly for their use case. Additionally, they will be able to share this with other collaborators without requiring them to install any additional software. -3. Next, we will train the data for clustering. -3. After the training is complete, we can view the mapping between customer and their associated cluster. -3. And finally, Cleanup Step to drop Keyspaces table to avoid future charges. +2. Next, we establish a connection to Amazon Keyspaces +3. Next, we create new Keyspace ***blog_(yyyymmdd)*** and a new table ***online_retail*** +3. Next, we download retail data about customers. +3. Next, we ingest retail data about customers into Keyspaces. +3. Next, we use a notebook available within SageMaker Studio to collect data from the Keyspaces database, and prepare data for training using KNN Algorithm. Most of our customers use SageMaker Studio for end-to-end development of ML Use Cases. They use this notebook as a starting point and customize it for their use case. Also, they are able to share this with other collaborators without requiring them to install any additional software. +3. Next, we train the data for clustering. +3. When the training is completed, we can view the mapping between customers and their associated clusters. +3. And finally, we run a Cleanup Step to drop Keyspaces table to avoid future charges. -Contributers +Contributers - `Vadim Lyakhovich (AWS)` - `Ram Pathangi (AWS)` - `Parth Patel (AWS)` +- `Arvind Jain (AWS)` ### Note The notebook execution role must include permissions to access Amazon Keyspaces and Assume the role. From c4eee6de65dfb7225593be8198adbd734e759610 Mon Sep 17 00:00:00 2001 From: Vadim Lyakhovich Date: Tue, 28 Jun 2022 15:55:20 -0700 Subject: [PATCH 2/6] Improved message and provied link to the blog --- ingest_data/sagemaker-keyspaces/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ingest_data/sagemaker-keyspaces/README.md b/ingest_data/sagemaker-keyspaces/README.md index f04c363bc4..c25ed9d07a 100644 --- a/ingest_data/sagemaker-keyspaces/README.md +++ b/ingest_data/sagemaker-keyspaces/README.md @@ -1,4 +1,4 @@ -# Train Machine Learning Models using Amazon Keyspaces as a Data Source +# Train Machine Learning Models using Amazon Keyspaces as a Data Source Please read [Train machine learning models using Amazon Keyspaces as a data source](https://aws.amazon.com/blogs/machine-learning/train-machine-learning-models-using-amazon-keyspaces-as-a-data-source/) blog for more detailed instructions to run this notebook. From 52497a767d243cbe1b0d77cdb38efbabf5ab2352 Mon Sep 17 00:00:00 2001 From: Vadim Lyakhovich Date: Tue, 28 Jun 2022 16:08:13 -0700 Subject: [PATCH 3/6] Add optinal cells to the notebook from the blog --- .../SageMaker_Keyspaces_ml_example.ipynb | 106 ++++++++++++------ 1 file changed, 72 insertions(+), 34 deletions(-) diff --git a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb index 71885ca5f5..a43401ed5e 100644 --- a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb +++ b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb @@ -2,15 +2,15 @@ "cells": [ { "cell_type": "markdown", - "id": "bf23f259", "metadata": {}, "source": [ - "# Train Machine Learning Models using Amazon Keyspaces as a Data Source\n", + "## Train Machine Learning Models using Amazon Keyspaces as a Data Source. \n", "\n", "Contributors\n", "- `Vadim Lyakhovich (AWS)`\n", "- `Ram Pathangi (AWS)`\n", "- `Parth Patel (AWS)`\n", + "- `Arvind Jain (AWS)`\n", "\n", "\n", "*Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved* \n", @@ -19,7 +19,6 @@ }, { "cell_type": "markdown", - "id": "9fec995d", "metadata": {}, "source": [ "### Prerequisites\n", @@ -55,7 +54,6 @@ }, { "cell_type": "markdown", - "id": "698592f9", "metadata": {}, "source": [ "In this notebook, \n", @@ -78,7 +76,6 @@ { "cell_type": "code", "execution_count": null, - "id": "fbfbf162", "metadata": {}, "outputs": [], "source": [ @@ -127,7 +124,6 @@ { "cell_type": "code", "execution_count": null, - "id": "92884fc8", "metadata": {}, "outputs": [], "source": [ @@ -163,7 +159,6 @@ }, { "cell_type": "markdown", - "id": "835508e5", "metadata": {}, "source": [ "## Download Sample data \n", @@ -176,7 +171,6 @@ { "cell_type": "code", "execution_count": null, - "id": "cafb2242", "metadata": {}, "outputs": [], "source": [ @@ -187,7 +181,6 @@ }, { "cell_type": "markdown", - "id": "8212936b", "metadata": {}, "source": [ "## In this step we will create a new ***blog_(yyyymmdd)*** keyspace and ***online_retail*** table\n", @@ -216,7 +209,6 @@ { "cell_type": "code", "execution_count": null, - "id": "4e9a6058", "metadata": {}, "outputs": [], "source": [ @@ -251,7 +243,6 @@ }, { "cell_type": "markdown", - "id": "90e2c6a8", "metadata": {}, "source": [ "## Loading Data\n", @@ -261,7 +252,6 @@ { "cell_type": "code", "execution_count": null, - "id": "d423e632", "metadata": {}, "outputs": [], "source": [ @@ -339,7 +329,6 @@ }, { "cell_type": "markdown", - "id": "f3984022", "metadata": {}, "source": [ "## ML Code" @@ -347,18 +336,22 @@ }, { "cell_type": "markdown", - "id": "f81e86eb", "metadata": {}, "source": [ "Now that we have data in Keyspace, let's read the data from Keyspace into the data frame. Once you have data into data frame you can perform the data cleanup to make sure it’s ready to train the modal. \n", "\n", - "In this example we will also group the data based on Recency, Frequency and Monetary value to generate RFM Matrix." + "In this example we will also group the data based on Recency, Frequency and Monetary value to generate RFM Matrix. Our business objective given the data set is to cluster the customers using this specific metric call RFM. The RFM model is based on three quantitative factors:\n", + "\n", + " - Recency: How recently a customer has made a purchase.\n", + " - Frequency: How often a customer makes a purchase.\n", + " - Monetary Value: How much money a customer spends on purchases.\n", + "\n", + "RFM analysis numerically ranks a customer in each of these three categories, generally on a scale of 1 to 5 (the higher the number, the better the result). The “best” customer would receive a top score in every category. We’ll use pandas’s Quantile-based discretization function (qcut). It will help discretize values into equal-sized buckets based or based on sample quantiles. At end we see predicted cluster / segments for customers described like \"New Customers\", \"Hibernating\", \"Promising\" etc. " ] }, { "cell_type": "code", "execution_count": null, - "id": "876e4ed3", "metadata": {}, "outputs": [], "source": [ @@ -446,16 +439,21 @@ }, { "cell_type": "markdown", - "id": "435c1499", "metadata": {}, "source": [ - "Now that we have our final dataset, we will start our training." + "Now that we have our final dataset, we will start our training. \n", + "\n", + "Here you will notice that we are doing data engineering by using a transform function that scales each feature to a given range. MinMaxScaler() fubction scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one in our case.\n", + "\n", + "Next we do transform (analyzes the data to generate the coefficients) and fit (calculates the parameters/weights on training data) on the input data at a single time and converts the data points. This fit_transform() method used below is basically a combination of fit method and transform method, it is equivalent to fit(). transform(). \n", + "\n", + "\n", + "Next, the KMeans algorithm creates the clusters (customers grouped together based on various attributes in the data set). This cluster information (a.k.a segments) can be used for targeted marketing campaigns." ] }, { "cell_type": "code", "execution_count": null, - "id": "6715aa6a", "metadata": {}, "outputs": [], "source": [ @@ -468,12 +466,11 @@ "kmeans = KMeans(n_clusters=6).fit(df)\n", "\n", "# Result\n", - "kumeler = kmeans.labels_" + "segment = kmeans.labels_" ] }, { "cell_type": "markdown", - "id": "0e9da533", "metadata": {}, "source": [ "Let’s visualize the data to see how records are distributed in different clusters." @@ -482,38 +479,70 @@ { "cell_type": "code", "execution_count": null, - "id": "62dfba86", "metadata": {}, "outputs": [], "source": [ "# Visualize the clusters\n", "import matplotlib.pyplot as plt\n", "\n", - "final_df = pd.DataFrame({\"customer_id\": rfm.index, \"Kumeler\": kumeler})\n", - "bucket_deta = final_df.groupby(\"Kumeler\").agg({\"customer_id\": \"count\"}).head()\n", - "index_deta = final_df.groupby(\"Kumeler\").agg({\"Kumeler\": \"max\"}).head()\n", - "index_deta[\"Kumeler\"] = index_deta[\"Kumeler\"].astype(int)\n", - "dataFrame = pd.DataFrame(data=bucket_deta[\"customer_id\"], index=index_deta[\"Kumeler\"])\n", + "final_df = pd.DataFrame({\"customer_id\": rfm.index, \"Segment\": segment})\n", + "bucket_data = final_df.groupby(\"Segment\").agg({\"customer_id\": \"count\"}).head()\n", + "index_data = final_df.groupby(\"Segment\").agg({\"Segment\": \"max\"}).head()\n", + "index_data[\"Segment\"] = index_data[\"Segment\"].astype(int)\n", + "dataFrame = pd.DataFrame(data=bucket_data[\"customer_id\"], index=index_data[\"Segment\"])\n", "dataFrame.rename(columns={\"customer_id\": \"Total Customers\"}).plot.bar(\n", - " rot=70, title=\"RFM clustoring\"\n", - ")\n", + " rot=70, title=\"RFM clustering\"\n", + ") \n", "# dataFrame.plot.bar(rot=70, title=\"RFM clustoring\");\n", "plt.show(block=True);" ] }, { "cell_type": "markdown", - "id": "95da7c87", + "metadata": {}, + "source": [ + "## (Optional)\n", + "Next, we save the customer segments that have been identified by the ML model back to an Amazon Keyspaces table for targeted marketing. A batch job could read this data and run targeted campaigns to customers in specific segments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create ml_clustering_results table to store results \n", + "createTable = \"\"\"CREATE TABLE IF NOT EXISTS %s.ml_clustering_results ( \n", + " run_id text,\n", + " segment int,\n", + " total_customers int,\n", + " run_date date,\n", + " PRIMARY KEY (run_id, segment));\n", + "\"\"\"\n", + "cr = session.execute(createTable % keyspaces_schema)\n", + "time.sleep(20)\n", + "print(\"Table 'ml_clustering_results' created\")\n", + " \n", + "insert_ml = (\n", + " \"INSERT INTO \"\n", + " + keyspaces_schema\n", + " + '.ml_clustering_results' \n", + " + '(\"run_id\",\"segment\",\"total_customers\",\"run_date\") ' \n", + " + 'VALUES (?,?,?,?); '\n", + ")" + ] + }, + { + "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup \n", - "In this step we will drop the Keyspaces to prevent future charges" + "Finally, we clean up the resources created during this tutorial to avoid incurring additional charges. So in this step we will drop the Keyspaces to prevent future charges" ] }, { "cell_type": "code", "execution_count": null, - "id": "b6ea31d3", "metadata": {}, "outputs": [], "source": [ @@ -526,10 +555,19 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It may take a few seconds to a minute to complete the deletion of keyspace and tables. When you delete a keyspace, the keyspace and all of its tables are deleted and you stop accruing charges from them.\n", + "\n", + "## Conclusion\n", + "This notebook showed python code that helped you to ingest customer data from Amazon Keyspaces into SageMaker and train a clustering model that allowed you to segment customers. You could use this information for targeted marketing, thus greatly improving your business KPI." + ] + }, { "cell_type": "code", "execution_count": null, - "id": "87221772", "metadata": {}, "outputs": [], "source": [] @@ -540,7 +578,7 @@ "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" + "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-1:742091327244:image/datascience-1.0" }, "language_info": { "codemirror_mode": { From 3daaf88be2496609b62629b471d91779126b9680 Mon Sep 17 00:00:00 2001 From: Vadim Lyakhovich Date: Tue, 28 Jun 2022 21:08:21 -0700 Subject: [PATCH 4/6] reformat using black-nb -l 100 SageMaker_Keyspaces_ml_example.ipynb --- .../SageMaker_Keyspaces_ml_example.ipynb | 36 +++++++++++++++---- 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb index a43401ed5e..01aa7902f4 100644 --- a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb +++ b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "a67683f9", "metadata": {}, "source": [ "## Train Machine Learning Models using Amazon Keyspaces as a Data Source. \n", @@ -19,6 +20,7 @@ }, { "cell_type": "markdown", + "id": "740f14cb", "metadata": {}, "source": [ "### Prerequisites\n", @@ -54,6 +56,7 @@ }, { "cell_type": "markdown", + "id": "ce83dfe4", "metadata": {}, "source": [ "In this notebook, \n", @@ -76,6 +79,7 @@ { "cell_type": "code", "execution_count": null, + "id": "dfd67d7a", "metadata": {}, "outputs": [], "source": [ @@ -124,6 +128,7 @@ { "cell_type": "code", "execution_count": null, + "id": "b3323e0a", "metadata": {}, "outputs": [], "source": [ @@ -159,6 +164,7 @@ }, { "cell_type": "markdown", + "id": "3277843a", "metadata": {}, "source": [ "## Download Sample data \n", @@ -171,6 +177,7 @@ { "cell_type": "code", "execution_count": null, + "id": "932b078d", "metadata": {}, "outputs": [], "source": [ @@ -181,6 +188,7 @@ }, { "cell_type": "markdown", + "id": "915e0088", "metadata": {}, "source": [ "## In this step we will create a new ***blog_(yyyymmdd)*** keyspace and ***online_retail*** table\n", @@ -209,6 +217,7 @@ { "cell_type": "code", "execution_count": null, + "id": "f1724300", "metadata": {}, "outputs": [], "source": [ @@ -243,6 +252,7 @@ }, { "cell_type": "markdown", + "id": "29d2e1e4", "metadata": {}, "source": [ "## Loading Data\n", @@ -252,6 +262,7 @@ { "cell_type": "code", "execution_count": null, + "id": "0bcdb19f", "metadata": {}, "outputs": [], "source": [ @@ -329,6 +340,7 @@ }, { "cell_type": "markdown", + "id": "d45b81c3", "metadata": {}, "source": [ "## ML Code" @@ -336,6 +348,7 @@ }, { "cell_type": "markdown", + "id": "720476aa", "metadata": {}, "source": [ "Now that we have data in Keyspace, let's read the data from Keyspace into the data frame. Once you have data into data frame you can perform the data cleanup to make sure it’s ready to train the modal. \n", @@ -352,6 +365,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4c470775", "metadata": {}, "outputs": [], "source": [ @@ -439,6 +453,7 @@ }, { "cell_type": "markdown", + "id": "b9a49cdd", "metadata": {}, "source": [ "Now that we have our final dataset, we will start our training. \n", @@ -454,6 +469,7 @@ { "cell_type": "code", "execution_count": null, + "id": "a4d21a63", "metadata": {}, "outputs": [], "source": [ @@ -471,6 +487,7 @@ }, { "cell_type": "markdown", + "id": "cefbf852", "metadata": {}, "source": [ "Let’s visualize the data to see how records are distributed in different clusters." @@ -479,6 +496,7 @@ { "cell_type": "code", "execution_count": null, + "id": "96fc6f46", "metadata": {}, "outputs": [], "source": [ @@ -492,13 +510,14 @@ "dataFrame = pd.DataFrame(data=bucket_data[\"customer_id\"], index=index_data[\"Segment\"])\n", "dataFrame.rename(columns={\"customer_id\": \"Total Customers\"}).plot.bar(\n", " rot=70, title=\"RFM clustering\"\n", - ") \n", + ")\n", "# dataFrame.plot.bar(rot=70, title=\"RFM clustoring\");\n", "plt.show(block=True);" ] }, { "cell_type": "markdown", + "id": "9d21faef", "metadata": {}, "source": [ "## (Optional)\n", @@ -508,10 +527,11 @@ { "cell_type": "code", "execution_count": null, + "id": "5970b871", "metadata": {}, "outputs": [], "source": [ - "# Create ml_clustering_results table to store results \n", + "# Create ml_clustering_results table to store results\n", "createTable = \"\"\"CREATE TABLE IF NOT EXISTS %s.ml_clustering_results ( \n", " run_id text,\n", " segment int,\n", @@ -522,18 +542,19 @@ "cr = session.execute(createTable % keyspaces_schema)\n", "time.sleep(20)\n", "print(\"Table 'ml_clustering_results' created\")\n", - " \n", + "\n", "insert_ml = (\n", " \"INSERT INTO \"\n", " + keyspaces_schema\n", - " + '.ml_clustering_results' \n", - " + '(\"run_id\",\"segment\",\"total_customers\",\"run_date\") ' \n", - " + 'VALUES (?,?,?,?); '\n", + " + \".ml_clustering_results\"\n", + " + '(\"run_id\",\"segment\",\"total_customers\",\"run_date\") '\n", + " + \"VALUES (?,?,?,?); \"\n", ")" ] }, { "cell_type": "markdown", + "id": "f04255a1", "metadata": {}, "source": [ "## Cleanup \n", @@ -543,6 +564,7 @@ { "cell_type": "code", "execution_count": null, + "id": "5e58f5a6", "metadata": {}, "outputs": [], "source": [ @@ -557,6 +579,7 @@ }, { "cell_type": "markdown", + "id": "ce437b8f", "metadata": {}, "source": [ "It may take a few seconds to a minute to complete the deletion of keyspace and tables. When you delete a keyspace, the keyspace and all of its tables are deleted and you stop accruing charges from them.\n", @@ -568,6 +591,7 @@ { "cell_type": "code", "execution_count": null, + "id": "6b35cc0d", "metadata": {}, "outputs": [], "source": [] From 20d21eff8785fbd2e70cddd7bc161a377b901ffd Mon Sep 17 00:00:00 2001 From: Vadim Lyakhovich Date: Tue, 28 Jun 2022 22:07:37 -0700 Subject: [PATCH 5/6] reformat using black-nb -l 100 SageMaker_Keyspaces_ml_example.ipynb --- .../SageMaker_Keyspaces_ml_example.ipynb | 74 ++++++++++++------- 1 file changed, 46 insertions(+), 28 deletions(-) diff --git a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb index 01aa7902f4..da07ea5494 100644 --- a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb +++ b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "a67683f9", + "id": "5f11bc6e", "metadata": {}, "source": [ "## Train Machine Learning Models using Amazon Keyspaces as a Data Source. \n", @@ -20,7 +20,7 @@ }, { "cell_type": "markdown", - "id": "740f14cb", + "id": "4ba54244", "metadata": {}, "source": [ "### Prerequisites\n", @@ -56,7 +56,7 @@ }, { "cell_type": "markdown", - "id": "ce83dfe4", + "id": "3299688a", "metadata": {}, "source": [ "In this notebook, \n", @@ -79,7 +79,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dfd67d7a", + "id": "40527f71", "metadata": {}, "outputs": [], "source": [ @@ -128,7 +128,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b3323e0a", + "id": "74968829", "metadata": {}, "outputs": [], "source": [ @@ -164,7 +164,7 @@ }, { "cell_type": "markdown", - "id": "3277843a", + "id": "b94bca48", "metadata": {}, "source": [ "## Download Sample data \n", @@ -177,7 +177,7 @@ { "cell_type": "code", "execution_count": null, - "id": "932b078d", + "id": "4cee4b1b", "metadata": {}, "outputs": [], "source": [ @@ -188,7 +188,7 @@ }, { "cell_type": "markdown", - "id": "915e0088", + "id": "b2445623", "metadata": {}, "source": [ "## In this step we will create a new ***blog_(yyyymmdd)*** keyspace and ***online_retail*** table\n", @@ -217,7 +217,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f1724300", + "id": "f80c6ada", "metadata": {}, "outputs": [], "source": [ @@ -252,7 +252,7 @@ }, { "cell_type": "markdown", - "id": "29d2e1e4", + "id": "dd5322a5", "metadata": {}, "source": [ "## Loading Data\n", @@ -262,7 +262,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0bcdb19f", + "id": "8d8d53bd", "metadata": {}, "outputs": [], "source": [ @@ -340,7 +340,7 @@ }, { "cell_type": "markdown", - "id": "d45b81c3", + "id": "7f27dfa2", "metadata": {}, "source": [ "## ML Code" @@ -348,7 +348,7 @@ }, { "cell_type": "markdown", - "id": "720476aa", + "id": "c89529f1", "metadata": {}, "source": [ "Now that we have data in Keyspace, let's read the data from Keyspace into the data frame. Once you have data into data frame you can perform the data cleanup to make sure it’s ready to train the modal. \n", @@ -365,7 +365,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4c470775", + "id": "37d10662", "metadata": {}, "outputs": [], "source": [ @@ -453,7 +453,7 @@ }, { "cell_type": "markdown", - "id": "b9a49cdd", + "id": "50c8b2a7", "metadata": {}, "source": [ "Now that we have our final dataset, we will start our training. \n", @@ -469,7 +469,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a4d21a63", + "id": "d1bdcfaa", "metadata": {}, "outputs": [], "source": [ @@ -487,7 +487,7 @@ }, { "cell_type": "markdown", - "id": "cefbf852", + "id": "04ca83e8", "metadata": {}, "source": [ "Let’s visualize the data to see how records are distributed in different clusters." @@ -496,7 +496,7 @@ { "cell_type": "code", "execution_count": null, - "id": "96fc6f46", + "id": "1717f156", "metadata": {}, "outputs": [], "source": [ @@ -517,7 +517,7 @@ }, { "cell_type": "markdown", - "id": "9d21faef", + "id": "b151fb27", "metadata": {}, "source": [ "## (Optional)\n", @@ -527,7 +527,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5970b871", + "id": "063f4aa5", "metadata": {}, "outputs": [], "source": [ @@ -549,12 +549,30 @@ " + \".ml_clustering_results\"\n", " + '(\"run_id\",\"segment\",\"total_customers\",\"run_date\") '\n", " + \"VALUES (?,?,?,?); \"\n", - ")" + ")\n", + "\n", + "prepared = session.prepare(insert_ml)\n", + "prepared.consistency_level = ConsistencyLevel.LOCAL_QUORUM\n", + "\n", + "run_id = \"101\"\n", + "dt = datetime.now()\n", + "\n", + "for ind in dataFrame.index:\n", + " print(ind, dataFrame[\"customer_id\"][ind])\n", + " r = session.execute(\n", + " prepared,\n", + " (\n", + " run_id,\n", + " ind,\n", + " dataFrame[\"customer_id\"][ind],\n", + " dt,\n", + " ),\n", + " )" ] }, { "cell_type": "markdown", - "id": "f04255a1", + "id": "0665dcb8", "metadata": {}, "source": [ "## Cleanup \n", @@ -564,7 +582,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5e58f5a6", + "id": "5e51f453", "metadata": {}, "outputs": [], "source": [ @@ -579,7 +597,7 @@ }, { "cell_type": "markdown", - "id": "ce437b8f", + "id": "1f680e91", "metadata": {}, "source": [ "It may take a few seconds to a minute to complete the deletion of keyspace and tables. When you delete a keyspace, the keyspace and all of its tables are deleted and you stop accruing charges from them.\n", @@ -591,7 +609,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6b35cc0d", + "id": "29be2259", "metadata": {}, "outputs": [], "source": [] @@ -600,9 +618,9 @@ "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { - "display_name": "Python 3 (Data Science)", + "display_name": "conda_python3", "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-1:742091327244:image/datascience-1.0" + "name": "conda_python3" }, "language_info": { "codemirror_mode": { @@ -614,7 +632,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" + "version": "3.8.12" } }, "nbformat": 4, From 4664b4304fd35232a81338a37ed6f6aa885d1553 Mon Sep 17 00:00:00 2001 From: Vadim Lyakhovich Date: Tue, 5 Jul 2022 15:29:09 -0700 Subject: [PATCH 6/6] commit to address reviewer feedback --- ingest_data/sagemaker-keyspaces/README.md | 6 +- .../SageMaker_Keyspaces_ml_example.ipynb | 139 ++++++++++++------ 2 files changed, 96 insertions(+), 49 deletions(-) diff --git a/ingest_data/sagemaker-keyspaces/README.md b/ingest_data/sagemaker-keyspaces/README.md index c25ed9d07a..44174614a7 100644 --- a/ingest_data/sagemaker-keyspaces/README.md +++ b/ingest_data/sagemaker-keyspaces/README.md @@ -1,9 +1,9 @@ -# Train Machine Learning Models using Amazon Keyspaces as a Data Source +# Train Machine Learning Models using Amazon Keyspaces as a Data Source Please read [Train machine learning models using Amazon Keyspaces as a data source](https://aws.amazon.com/blogs/machine-learning/train-machine-learning-models-using-amazon-keyspaces-as-a-data-source/) blog for more detailed instructions to run this notebook. -We will provide step-by-step instructions to use SageMaker to ingest customer data from Amazon Keyspaces and train a clustering model that would enable you to segment customers. This information can be used for targeted marketing, greatly improving your business KPI. +We provides step-by-step instructions to use SageMaker to ingest customer data from Amazon Keyspaces and train a clustering model that enables you to segment customers. This information can be used for targeted marketing, greatly improving your business KPI. 1. First, we install Sigv4 driver to connect to Amazon Keyspaces @@ -13,7 +13,7 @@ We will provide step-by-step instructions to use SageMaker to ingest customer da 3. Next, we create new Keyspace ***blog_(yyyymmdd)*** and a new table ***online_retail*** 3. Next, we download retail data about customers. 3. Next, we ingest retail data about customers into Keyspaces. -3. Next, we use a notebook available within SageMaker Studio to collect data from the Keyspaces database, and prepare data for training using KNN Algorithm. Most of our customers use SageMaker Studio for end-to-end development of ML Use Cases. They use this notebook as a starting point and customize it for their use case. Also, they are able to share this with other collaborators without requiring them to install any additional software. +3. Next, we use a notebook available within SageMaker Studio to collect data from the Keyspaces database, and prepare data for training using KNN Algorithm. Most of our customers use SageMaker Studio for end-to-end development of ML Use Cases. They use this notebook as a starting point and customize it for their use case. Also, they are able to share this with other collaborators without requiring them to install any additional software. 3. Next, we train the data for clustering. 3. When the training is completed, we can view the mapping between customers and their associated clusters. 3. And finally, we run a Cleanup Step to drop Keyspaces table to avoid future charges. diff --git a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb index da07ea5494..65cdc17001 100644 --- a/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb +++ b/ingest_data/sagemaker-keyspaces/SageMaker_Keyspaces_ml_example.ipynb @@ -2,10 +2,10 @@ "cells": [ { "cell_type": "markdown", - "id": "5f11bc6e", + "id": "207aa954", "metadata": {}, "source": [ - "## Train Machine Learning Models using Amazon Keyspaces as a Data Source. \n", + "## Train Machine Learning Models using Amazon Keyspaces as a Data Source \n", "\n", "Contributors\n", "- `Vadim Lyakhovich (AWS)`\n", @@ -20,7 +20,7 @@ }, { "cell_type": "markdown", - "id": "4ba54244", + "id": "2d10bdcf", "metadata": {}, "source": [ "### Prerequisites\n", @@ -56,7 +56,7 @@ }, { "cell_type": "markdown", - "id": "3299688a", + "id": "29439d73", "metadata": {}, "source": [ "In this notebook, \n", @@ -67,10 +67,10 @@ "\n", "2. Next, we establish a connection to Amazon Keyspaces \n", "3. Next, we create a new `Keyspace ***blog_(yyyymmdd)***` and a new `table ***online_retail***`\n", - "3. Next, we will download retail data about customers.\n", - "3. Next, we will ingest retail data about customers into Keyspaces.\n", - "3. Next, we use a notebook available within SageMaker Studio to collect data from Keyspaces database, and prepare data for training using KNN Algorithm. Most of our customers use SageMaker Studio for end to end development of ML Use Cases. They could use this notebook as a base and customize it quickly for their use case. Additionally, they will be able to share this with other collaborators without requiring them to install any additional software. \n", - "3. Next, we will train the data for clustering.\n", + "3. Next, we download retail data about customers.\n", + "3. Next, we ingest retail data about customers into Keyspaces.\n", + "3. Next, we use a notebook available within SageMaker Studio to collect data from Keyspaces database, and prepare data for training using KNN Algorithm. Most of our customers use SageMaker Studio for end to end development of ML Use Cases. They could use this notebook as a base and customize it quickly for their use case. Additionally, the customers can share this with other collaborators without requiring them to install any additional software. \n", + "3. Next, we train the data for clustering.\n", "3. After the training is complete, we can view the mapping between customer and their associated cluster.\n", "3. And finally, Cleanup Step to drop Keyspaces table to avoid future charges. \n", "\n" @@ -79,7 +79,7 @@ { "cell_type": "code", "execution_count": null, - "id": "40527f71", + "id": "17cd92c7", "metadata": {}, "outputs": [], "source": [ @@ -128,7 +128,7 @@ { "cell_type": "code", "execution_count": null, - "id": "74968829", + "id": "aee7e72d", "metadata": {}, "outputs": [], "source": [ @@ -164,7 +164,7 @@ }, { "cell_type": "markdown", - "id": "b94bca48", + "id": "e14d7fa9", "metadata": {}, "source": [ "## Download Sample data \n", @@ -177,7 +177,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4cee4b1b", + "id": "c3aa9ea8", "metadata": {}, "outputs": [], "source": [ @@ -188,11 +188,10 @@ }, { "cell_type": "markdown", - "id": "b2445623", + "id": "717151f6", "metadata": {}, "source": [ - "## In this step we will create a new ***blog_(yyyymmdd)*** keyspace and ***online_retail*** table\n", - "In this step we will create a new `Keyspace ***blog_yyymmdd***` and `table ***online_retail***`\n", + "#### In this step we create a new ***blog_(yyyymmdd)*** keyspace and ***online_retail*** table\n", "\n", "```\n", "CREATE KEYSPACE IF NOT EXISTS blog_yyyymmdd\n", @@ -217,7 +216,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f80c6ada", + "id": "98ccf29c", "metadata": {}, "outputs": [], "source": [ @@ -252,7 +251,7 @@ }, { "cell_type": "markdown", - "id": "dd5322a5", + "id": "ce87792a", "metadata": {}, "source": [ "## Loading Data\n", @@ -262,7 +261,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8d8d53bd", + "id": "2ce1813a", "metadata": {}, "outputs": [], "source": [ @@ -340,7 +339,7 @@ }, { "cell_type": "markdown", - "id": "7f27dfa2", + "id": "e2dc6a6a", "metadata": {}, "source": [ "## ML Code" @@ -348,34 +347,83 @@ }, { "cell_type": "markdown", - "id": "c89529f1", + "id": "79e48035", "metadata": {}, "source": [ - "Now that we have data in Keyspace, let's read the data from Keyspace into the data frame. Once you have data into data frame you can perform the data cleanup to make sure it’s ready to train the modal. \n", + "Now that we have data in Keyspace, let's read the data from Keyspace into the data frame. Once you have data into data frame you can perform the data cleanup to make sure it’s ready to train the modal. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "433fbc4a", + "metadata": {}, + "outputs": [], + "source": [ + "# Reading Data from Keyspaces\n", + "r = session.execute(\"select * from \" + keyspaces_schema + \".online_retail\")\n", + "\n", + "df = DataFrame(r)\n", + "df.head(100)" + ] + }, + { + "cell_type": "markdown", + "id": "cdfa1734", + "metadata": {}, + "source": [ + "in this example, we use CQL to read records from the Keyspace table.\n", + "\n", + "In some ML use-cases, you may need to read the same data from the same Keyspaces table multiple times. In this case, we recommend to save your data into an Amazon S3 bucket to avoid incurring additional costs reading from Amazon Keyspaces. Depending on your scenario, you may also use Amazon EMR to ingest a very large Amazon S3 file into SageMaker." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e06c0f7", + "metadata": {}, + "outputs": [], + "source": [ + "## Code to save Python DataFrame to S3\n", + "import sagemaker\n", + "from io import StringIO # python3 (or BytesIO for python2)\n", + "\n", + "smclient = boto3.Session().client(\"sagemaker\")\n", + "sess = sagemaker.Session()\n", + "bucket = sess.default_bucket() # Set a default S3 bucket\n", + "print(bucket)\n", + "\n", + "sess = sagemaker.Session()\n", "\n", - "In this example we will also group the data based on Recency, Frequency and Monetary value to generate RFM Matrix. Our business objective given the data set is to cluster the customers using this specific metric call RFM. The RFM model is based on three quantitative factors:\n", + "\n", + "csv_buffer = StringIO()\n", + "df.to_csv(csv_buffer)\n", + "s3_resource = boto3.resource(\"s3\")\n", + "s3_resource.Object(bucket, \"out/saved_online_retail.csv\").put(Body=csv_buffer.getvalue())" + ] + }, + { + "cell_type": "markdown", + "id": "e7b822ce", + "metadata": {}, + "source": [ + "In this example we also group the data based on Recency, Frequency and Monetary value to generate RFM Matrix. Our business objective given the data set is to cluster the customers using this specific metric call RFM. The RFM model is based on three quantitative factors:\n", "\n", " - Recency: How recently a customer has made a purchase.\n", " - Frequency: How often a customer makes a purchase.\n", " - Monetary Value: How much money a customer spends on purchases.\n", "\n", - "RFM analysis numerically ranks a customer in each of these three categories, generally on a scale of 1 to 5 (the higher the number, the better the result). The “best” customer would receive a top score in every category. We’ll use pandas’s Quantile-based discretization function (qcut). It will help discretize values into equal-sized buckets based or based on sample quantiles. At end we see predicted cluster / segments for customers described like \"New Customers\", \"Hibernating\", \"Promising\" etc. " + "RFM analysis numerically ranks a customer in each of these three categories, generally on a scale of 1 to 5 (the higher the number, the better the result). The best customer receives a top score in every category. We are using pandas’s Quantile-based discretization function (qcut). It helps discretize values into equal-sized buckets based or based on sample quantiles. At end we see predicted cluster / segments for customers described like \"New Customers\", \"Hibernating\", \"Promising\" etc. " ] }, { "cell_type": "code", "execution_count": null, - "id": "37d10662", + "id": "5199e64a", "metadata": {}, "outputs": [], "source": [ "# Prepare Data\n", - "\n", - "r = session.execute(\"select * from \" + keyspaces_schema + \".online_retail\")\n", - "\n", - "df = DataFrame(r)\n", - "df.head(100)\n", - "\n", "df.count()\n", "df[\"description\"].nunique()\n", "df[\"totalprice\"] = df[\"quantity\"] * df[\"price\"]\n", @@ -453,14 +501,14 @@ }, { "cell_type": "markdown", - "id": "50c8b2a7", + "id": "9ad3ab64", "metadata": {}, "source": [ - "Now that we have our final dataset, we will start our training. \n", + "Now that we have our final dataset, we can start our training. \n", "\n", - "Here you will notice that we are doing data engineering by using a transform function that scales each feature to a given range. MinMaxScaler() fubction scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one in our case.\n", + "Here you notice that we are doing data engineering by using a transform function that scales each feature to a given range. `MinMaxScaler()` function scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one in our case.\n", "\n", - "Next we do transform (analyzes the data to generate the coefficients) and fit (calculates the parameters/weights on training data) on the input data at a single time and converts the data points. This fit_transform() method used below is basically a combination of fit method and transform method, it is equivalent to fit(). transform(). \n", + "Next we do transform (analyzes the data to generate the coefficients) and fit (calculates the parameters/weights on training data) on the input data at a single time and converts the data points. This `fit_transform()` method used below is basically a combination of fit method and transform method, it is equivalent to `fit(). transform()`. \n", "\n", "\n", "Next, the KMeans algorithm creates the clusters (customers grouped together based on various attributes in the data set). This cluster information (a.k.a segments) can be used for targeted marketing campaigns." @@ -469,7 +517,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d1bdcfaa", + "id": "ea020e9f", "metadata": {}, "outputs": [], "source": [ @@ -487,7 +535,7 @@ }, { "cell_type": "markdown", - "id": "04ca83e8", + "id": "46ca68e4", "metadata": {}, "source": [ "Let’s visualize the data to see how records are distributed in different clusters." @@ -496,7 +544,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1717f156", + "id": "e1258e4a", "metadata": {}, "outputs": [], "source": [ @@ -517,21 +565,20 @@ }, { "cell_type": "markdown", - "id": "b151fb27", + "id": "e822c021", "metadata": {}, "source": [ - "## (Optional)\n", "Next, we save the customer segments that have been identified by the ML model back to an Amazon Keyspaces table for targeted marketing. A batch job could read this data and run targeted campaigns to customers in specific segments." ] }, { "cell_type": "code", "execution_count": null, - "id": "063f4aa5", + "id": "bc74f01d", "metadata": {}, "outputs": [], "source": [ - "# Create ml_clustering_results table to store results\n", + "# Create ml_clustering_results table to store the results\n", "createTable = \"\"\"CREATE TABLE IF NOT EXISTS %s.ml_clustering_results ( \n", " run_id text,\n", " segment int,\n", @@ -572,17 +619,17 @@ }, { "cell_type": "markdown", - "id": "0665dcb8", + "id": "2bd56924", "metadata": {}, "source": [ "## Cleanup \n", - "Finally, we clean up the resources created during this tutorial to avoid incurring additional charges. So in this step we will drop the Keyspaces to prevent future charges" + "Finally, we clean up the resources created during this tutorial to avoid incurring additional charges. So in this step we drop the Keyspaces to prevent future charges" ] }, { "cell_type": "code", "execution_count": null, - "id": "5e51f453", + "id": "3b76a063", "metadata": {}, "outputs": [], "source": [ @@ -597,7 +644,7 @@ }, { "cell_type": "markdown", - "id": "1f680e91", + "id": "039a2999", "metadata": {}, "source": [ "It may take a few seconds to a minute to complete the deletion of keyspace and tables. When you delete a keyspace, the keyspace and all of its tables are deleted and you stop accruing charges from them.\n", @@ -609,7 +656,7 @@ { "cell_type": "code", "execution_count": null, - "id": "29be2259", + "id": "d9de6153", "metadata": {}, "outputs": [], "source": []