From 435506384dc397daf8307f5fb74ac97ade386dcd Mon Sep 17 00:00:00 2001 From: rbbutch <110025969+rbbutch@users.noreply.github.com> Date: Fri, 29 Jul 2022 18:20:13 +0100 Subject: [PATCH 1/5] Added new example notebook to demonstrate 'risk bucketing' (#3512) * Added new example notebook to demonstrate 'risk bucketing' Added a new example notebook to demonstrate 'risk bucketing' under the /use-cases folder. This is a technique used in credit risk and is relevant to FSI customers * Fixed spelling issues, added markup and removed dependency on local file Fixed spelling issues, added markup and removed dependency on local file * Corrected spelling mistakes Corrected spelling mistakes * Fixed issue with CI check not passing due to string concatenation Fixed issue with CI check not passing due to string concatenation * Split string across multiple lines Split string across multiple lines * Fixed formatting issue with pip install Fixed formatting issue with pip install * Fixed version incompatibility issue between s3fs and pandas Fixed version incompatibility issue between s3fs and pandas Co-authored-by: atqy <95724753+atqy@users.noreply.github.com> --- use-cases/credit_risk/risk_bucketing.ipynb | 477 +++++++++++++++++++++ use-cases/index.rst | 9 + 2 files changed, 486 insertions(+) create mode 100644 use-cases/credit_risk/risk_bucketing.ipynb diff --git a/use-cases/credit_risk/risk_bucketing.ipynb b/use-cases/credit_risk/risk_bucketing.ipynb new file mode 100644 index 0000000000..614fc5362f --- /dev/null +++ b/use-cases/credit_risk/risk_bucketing.ipynb @@ -0,0 +1,477 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Risk Bucketing\n", + "One of the most common use cases for machine learning in financial services is estimating the probability of default on a loan.\n", + "\n", + "Risk bucketing refers to the process of grouping borrowers with similar creditworthiness. Treating all borrowers equally will generally result in poor predictions, as the model cannot capture entirely different characteristics of the data all at once. By dividing borrowers into different groups based on risk characteristics, risk bucketing enables us to make accurate predictions.\n", + "\n", + "Risk bucketing is a good example of an unsupervised clustering problem. The K-means algorithm (which we use here) is one way we can perform risk bucketing.\n", + "\n", + "However, there is one major issue: how do we know the optimal number of risk buckets (clusters) to use for a given set of data/borrowers? This notebook demonstrates a number of techniques for calculating the optimal number of clusters:\n", + "- The Elbow Method\n", + "- Silhouette Scores\n", + "- Gap Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import boto3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our data source is the well-known German Credit Risk data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s3 = boto3.resource(\"s3\")\n", + "s3_sample = s3.Object(\n", + " \"sagemaker-sample-files\",\n", + " \"datasets/tabular/uci_statlog_german_credit_data/german_credit_data.csv\",\n", + ").get()[\"Body\"]\n", + "credit = pd.read_csv(s3_sample)\n", + "credit.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We drop the numeric columns *dependents* and *existingcredits* as we are not going to use them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "credit.drop([\"dependents\", \"existingcredits\"], inplace=True, axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Convert the *job* column, which contains categorical values, into a numerical one that contains ordinal values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "credit[\"job\"].replace(\n", + " [\n", + " \"unemployed\",\n", + " \"unskilled\",\n", + " \"skilled employee / official\",\n", + " \"management / highly skilled\",\n", + " ],\n", + " [0, 1, 2, 3],\n", + " inplace=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "sns.set()\n", + "plt.rcParams[\"figure.figsize\"] = (10, 6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Drop all columns except for the following (numeric) ones:\n", + "- Age\n", + "- Job (ordinal ranging from 0=unemployed to 3=management / highly skilled)\n", + "- Credit amount\n", + "- Duration (the duration of the loan in months)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numerical_credit = credit.select_dtypes(exclude=\"O\")\n", + "numerical_credit.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plotting histograms of these four features, we can see that they are all positively skewed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(10, 8))\n", + "k = 0\n", + "cols = numerical_credit.columns\n", + "for i, j in zip(range(len(cols)), cols):\n", + " k += 1\n", + " plt.subplot(2, 2, k)\n", + " plt.hist(numerical_credit.iloc[:, i])\n", + " plt.title(j)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The Elbow Method\n", + "Our first method for estimating the optimal number of clusters is the Elbow Method. This involves calculating *inertia* which is the sum of the squared distances of observations from their closest centroid.\n", + "\n", + "If we plot inertia against number of clusters (k) we can see an *elbow* (i.e. the curve starts to flatten out) around the value of k=4. This is an indication that increasing the number of clusters is undesirable (when traded-off against increased complexity).\n", + "\n", + "Here we are using the KMeans algorithm from Sklearn, as it provides inertia as part of its output. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.cluster import KMeans\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "scaler = StandardScaler()\n", + "scaled_credit = scaler.fit_transform(numerical_credit)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "inertia = []\n", + "for k in range(1, 10):\n", + " kmeans = KMeans(n_clusters=k)\n", + " kmeans.fit(scaled_credit)\n", + " inertia.append(kmeans.inertia_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.plot(range(1, 10), inertia, \"bx-\")\n", + "plt.xlabel(\"Number of Clusters (k)\")\n", + "plt.ylabel(\"Inertia\")\n", + "plt.title(\"The Elbow Method\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Silhouette Scores\n", + "Silhouette scores take a value between 1 and -1. A value of 1 indicates an observation that is close to the correct centroid and correctly classified. A value of -1 shows that the observation is not correctly clustered.\n", + "\n", + "The strength of the Silhouette Score is that it takes into account both the intra-cluster distance (how close observations are to their centroid) and the inter-cluster distance (how far apart the centroids are). The formula for Silhouette Score is as follows:\n", + "\\begin{equation}\n", + "Silhouette = \\frac{x - y}{max(x, y)}\n", + "\\end{equation}\n", + "where x is the mean inter-cluster distance between clusters, and y is the mean intra-cluster distance.\n", + "\n", + "In this case, we can see that the peak Silhouette score occurs when the number of clusters (k) is 2. This implies that it is not worth using more than 2 clusters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from matplotlib import cm\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.metrics import silhouette_score" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "silhouette_scores = []\n", + "for n_clusters in range(2, 10):\n", + " clusterer = KMeans(n_clusters=n_clusters)\n", + " preds = clusterer.fit_predict(scaled_credit)\n", + " centers = clusterer.cluster_centers_\n", + " silhouette_scores.append(silhouette_score(scaled_credit, preds))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.plot(range(2, 10), silhouette_scores, \"bx-\")\n", + "plt.xlabel(\"Number of Clusters (k)\")\n", + "plt.ylabel(\"Silhouette Score\")\n", + "plt.title(\"Silhouette Scores\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Gap Analysis\n", + "Our final approach is *gap analysis*. This is based on the work of Tibshirani et al. (2001) which proposes finding the optimal number of clusters\n", + "based on a reference distribution.\n", + "\n", + "Our data consists of $n$ independent observations of $p$ features. The Euclidean distance between observations $i$ and $i'$ is:\n", + "\\begin{equation}\n", + "d_{ii'} = \\sum_{j} (x_{ij} - x_{i'j})^2\n", + "\\end{equation}\n", + "\n", + "And the sum of all pairwise distances for points in cluster r is:\n", + "\\begin{equation}\n", + "D_{r} = \\sum_{i,i' \\in C_{r}} d_{ii'}\n", + "\\end{equation}\n", + "\n", + "Then the pooled, within-cluster sum of squares around the cluster mean is:\n", + "\\begin{equation}\n", + "W_{k} = \\sum_{r=1}^k \\frac{1}{2n_{r}}D_{r}\n", + "\\end{equation}\n", + "\n", + "The idea of this approach is to standardize the graph of $log(W_{k})$ by comparing it with its expectation under an appropriate null reference distribution of the data. The expectation of $W_{k}$ is approximately\n", + "\\begin{equation}\n", + "log(pn/12) - (2/p)log(k) + constant\n", + "\\end{equation}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install gap-stat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import the OptimalK module for calculating the gap statistic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from gap_statistic.optimalK import OptimalK" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Calculate the gap statistic for various values of $k$ using parallelization." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "optimalK = OptimalK(n_jobs=8, parallel_backend=\"joblib\")\n", + "n_clusters = optimalK(scaled_credit, cluster_array=np.arange(1, 10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gap_result = optimalK.gap_df\n", + "gap_result.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we plot the resulting gap values, we observe a sharp increase up to the point where the gap value reaches its peak. In this case this corresponds to 5 clusters. The analysis suggests that this is the optimal number for clustering." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.plot(gap_result.n_clusters, gap_result.gap_value)\n", + "min_ylim, max_ylim = plt.ylim()\n", + "plt.axhline(np.max(gap_result.gap_value), color=\"r\", linestyle=\"dashed\", linewidth=2)\n", + "plt.title(\"Gap Analysis\")\n", + "plt.xlabel(\"Number of Clusters\")\n", + "plt.ylabel(\"Gap Value\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## K-Means Clustering \n", + "Due to their differing approaches, the three analyses above all provide a different value for the optimal number of clusters. It will require some trial and error to determine which is indeed the optimal number of clusters.\n", + "\n", + "For example, let us proceed on the basis that the optimal number of clusters is two (as suggested by the *Silhouette Scores*).\n", + "\n", + "We perform K-Means clustering to separate our observations into two." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "kmeans = KMeans(n_clusters=2)\n", + "clusters = kmeans.fit_predict(scaled_credit)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we generate some plots to visualize the clusters in two dimensions. The plots show the observations (with color indicating the assigned cluster). Black crosses are used to show the position of the two centroids.\n", + "\n", + "The first plot shows the relationship between the *age* and *credit* features. Here we can see that *age* is the more dispersed feature, with the centroids located vertically inline.\n", + "\n", + "The second plot considers two continuous features: *credit* and *duration*. Here we observe two clearly separated clusters. This suggests that the *duration* feature is more volatile when compared with the *credit* feature.\n", + "\n", + "Finally, the third plot examines the relationship between *age* and *duration*. It turns out that there are many overlapping observations across these two features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(10, 12))\n", + "plt.subplot(311)\n", + "plt.scatter(scaled_credit[:, 2], scaled_credit[:, 1], c=kmeans.labels_, cmap=\"viridis\")\n", + "plt.scatter(\n", + " kmeans.cluster_centers_[:, 2],\n", + " kmeans.cluster_centers_[:, 1],\n", + " s=80,\n", + " marker=\"x\",\n", + " color=\"k\",\n", + ")\n", + "plt.title(\"Age vs Credit\")\n", + "plt.subplot(312)\n", + "plt.scatter(scaled_credit[:, 1], scaled_credit[:, 0], c=kmeans.labels_, cmap=\"viridis\")\n", + "plt.scatter(\n", + " kmeans.cluster_centers_[:, 1],\n", + " kmeans.cluster_centers_[:, 0],\n", + " s=80,\n", + " marker=\"x\",\n", + " color=\"k\",\n", + ")\n", + "plt.title(\"Credit vs Duration\")\n", + "plt.subplot(313)\n", + "plt.scatter(scaled_credit[:, 2], scaled_credit[:, 0], c=kmeans.labels_, cmap=\"viridis\")\n", + "plt.scatter(\n", + " kmeans.cluster_centers_[:, 2],\n", + " kmeans.cluster_centers_[:, 0],\n", + " s=120,\n", + " marker=\"x\",\n", + " color=\"k\",\n", + ")\n", + "plt.title(\"Age vs Duration\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion \n", + "In this notebook we have discussed why risk bucketing is necessary, and considered three different approaches to estimating the optimal number of risk buckets.\n", + "\n", + "Having estimated the optimal number of risk buckets, we made use of K-Means clustering to split our observations between the target number of risk buckets.\n", + "\n", + "The follow-on activity is to create models for estimating default risk for each risk bucket, with each model trained separately using the data corresponding to the corresponding risk bucket." + ] + } + ], + "metadata": { + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "Python 3 (Data Science)", + "language": "python", + "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/use-cases/index.rst b/use-cases/index.rst index 9ecbaea207..a0732c2aa8 100644 --- a/use-cases/index.rst +++ b/use-cases/index.rst @@ -46,3 +46,12 @@ Pipelines with NLP for Product Rating Prediction :maxdepth: 1 product_ratings_with_pipelines/pipelines_product_ratings + + +Credit Risk +----------- + +.. toctree:: + :maxdepth: 1 + + credit_risk/risk_bucketing \ No newline at end of file From 2e8c261f972bdbc9f108571513fddbbd42f84e5f Mon Sep 17 00:00:00 2001 From: rsgrewal-aws <102243526+rsgrewal-aws@users.noreply.github.com> Date: Fri, 29 Jul 2022 11:35:41 -0700 Subject: [PATCH 2/5] Update to add the Docker build files (#3508) * Add CatBoost MME BYOC example * formatted * Resolving comment # 1 and 2 * Resolving comment # 1 and 2 * Resolving comment # 4 * Resolving clean up comment * Added comments about CatBoost and usage for MME * Reformatted the jupyter file * Added the container with the relevant py files * Added formatting using Black. Also fixed the comments from the Jupyter file * Added formatting using Black. Also fixed the comments from the Jupyter file * Added formatting using Black. Also fixed the comments from the Jupyter file Co-authored-by: marckarp Co-authored-by: atqy <95724753+atqy@users.noreply.github.com> --- .../multi_model_catboost/container/Dockerfile | 47 ++++++++ .../container/__init__.py | 0 .../container/dockerd-entrypoint.py | 33 ++++++ .../container/model_handler.py | 108 ++++++++++++++++++ .../multi_model_catboost.ipynb | 6 +- 5 files changed, 191 insertions(+), 3 deletions(-) create mode 100644 advanced_functionality/multi_model_catboost/container/Dockerfile create mode 100644 advanced_functionality/multi_model_catboost/container/__init__.py create mode 100644 advanced_functionality/multi_model_catboost/container/dockerd-entrypoint.py create mode 100644 advanced_functionality/multi_model_catboost/container/model_handler.py diff --git a/advanced_functionality/multi_model_catboost/container/Dockerfile b/advanced_functionality/multi_model_catboost/container/Dockerfile new file mode 100644 index 0000000000..089390df06 --- /dev/null +++ b/advanced_functionality/multi_model_catboost/container/Dockerfile @@ -0,0 +1,47 @@ +FROM ubuntu:18.04 + +# Set a docker label to advertise multi-model support on the container +LABEL com.amazonaws.sagemaker.capabilities.multi-models=true +# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present +LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true + +# Install necessary dependencies for MMS and SageMaker Inference Toolkit +RUN apt-get update && \ + apt-get -y install --no-install-recommends \ + build-essential \ + ca-certificates \ + openjdk-8-jdk-headless \ + python3-dev \ + curl \ + python3 \ + vim \ + && rm -rf /var/lib/apt/lists/* \ + && curl -O https://bootstrap.pypa.io/pip/3.7/get-pip.py \ + && python3 get-pip.py + +RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1 +RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1 + +# Install MXNet, MMS, and SageMaker Inference Toolkit to set up MMS +RUN pip3 --no-cache-dir install multi-model-server \ + sagemaker-inference \ + retrying \ + catboost \ + pandas + + +# Copy entrypoint script to the image +COPY dockerd-entrypoint.py /usr/local/bin/dockerd-entrypoint.py +RUN chmod +x /usr/local/bin/dockerd-entrypoint.py +RUN echo "vmargs=-XX:-UseContainerSupport" >> /usr/local/lib/python3.6/dist-packages/sagemaker_inference/etc/mme-mms.properties + +RUN mkdir -p /home/model-server/ + +# Copy the default custom service file to handle incoming data and inference requests +COPY model_handler.py /home/model-server/model_handler.py + +# Define an entrypoint script for the docker image +ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"] + +# Define command to be passed to the entrypoint +CMD ["serve"] diff --git a/advanced_functionality/multi_model_catboost/container/__init__.py b/advanced_functionality/multi_model_catboost/container/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/advanced_functionality/multi_model_catboost/container/dockerd-entrypoint.py b/advanced_functionality/multi_model_catboost/container/dockerd-entrypoint.py new file mode 100644 index 0000000000..9082f92be9 --- /dev/null +++ b/advanced_functionality/multi_model_catboost/container/dockerd-entrypoint.py @@ -0,0 +1,33 @@ +import subprocess +import sys +import shlex +import os +from retrying import retry +from subprocess import CalledProcessError +from sagemaker_inference import model_server + + +def _retry_if_error(exception): + return isinstance(exception, CalledProcessError or OSError) + + +@retry(stop_max_delay=1000 * 50, retry_on_exception=_retry_if_error) +def _start_mms(): + # by default the number of workers per model is 1, but we can configure it through the + # environment variable below if desired. + os.environ["MMS_DEFAULT_WORKERS_PER_MODEL"] = "2" + os.environ["OMP_NUM_THREADS"] = "8" + model_server.start_model_server(handler_service="/home/model-server/model_handler.py:handle") + + +def main(): + if sys.argv[1] == "serve": + _start_mms() + else: + subprocess.check_call(shlex.split(" ".join(sys.argv[1:]))) + + # prevent docker exit + subprocess.call(["tail", "-f", "/dev/null"]) + + +main() diff --git a/advanced_functionality/multi_model_catboost/container/model_handler.py b/advanced_functionality/multi_model_catboost/container/model_handler.py new file mode 100644 index 0000000000..13e8429502 --- /dev/null +++ b/advanced_functionality/multi_model_catboost/container/model_handler.py @@ -0,0 +1,108 @@ +import os +import json +import sys +import logging +import time +import catboost +from catboost import CatBoostClassifier +import pandas as pd +import io + +logger = logging.getLogger(__name__) + +import os + + +class ModelHandler(object): + def __init__(self): + start = time.time() + self.initialized = False + print(f" perf __init__ {(time.time() - start) * 1000} ms") + + def initialize(self, ctx): + start = time.time() + self.device = "cpu" + + properties = ctx.system_properties + self.device = "cpu" + model_dir = properties.get("model_dir") + + print("model_dir {}".format(model_dir)) + print(os.system("ls {}".format(model_dir))) + + model_file = CatBoostClassifier() + + onlyfiles = [ + f + for f in os.listdir(model_dir) + if os.path.isfile(os.path.join(model_dir, f)) and f.endswith(".bin") + ] + print( + f"Modelhandler:model_file location::{model_dir}:: files:bin:={onlyfiles} :: going to load the first one::" + ) + self.model = model_file = model_file.load_model(onlyfiles[0]) + + self.initialized = True + print(f" perf initialize {(time.time() - start) * 1000} ms") + + def preprocess(self, input_data): + """ + Pre-process the request + """ + + start = time.time() + print(type(input_data)) + output = input_data + print(f" perf preprocess {(time.time() - start) * 1000} ms") + return output + + def inference(self, inputs): + """ + Make the inference request against the laoded model + """ + start = time.time() + + predictions = self.model.predict_proba(inputs) + print(f" perf inference {(time.time() - start) * 1000} ms") + return predictions + + def postprocess(self, inference_output): + """ + Post-process the request + """ + + start = time.time() + inference_output = dict(enumerate(inference_output.flatten(), 0)) + print(f" perf postprocess {(time.time() - start) * 1000} ms") + return [inference_output] + + def handle(self, data, context): + """ + Call pre-process, inference and post-process functions + :param data: input data + :param context: mms context + """ + start = time.time() + + input_data = data[0]["body"].decode() + df = pd.read_csv(io.StringIO(input_data)) + + model_input = self.preprocess(df) + model_output = self.inference(model_input) + print(f" perf handle in {(time.time() - start) * 1000} ms") + return self.postprocess(model_output) + + +_service = ModelHandler() + + +def handle(data, context): + start = time.time() + if not _service.initialized: + _service.initialize(context) + + if data is None: + return None + + print(f" perf handle_out {(time.time() - start) * 1000} ms") + return _service.handle(data, context) diff --git a/advanced_functionality/multi_model_catboost/multi_model_catboost.ipynb b/advanced_functionality/multi_model_catboost/multi_model_catboost.ipynb index 8582be29b1..cc9c7f91a5 100644 --- a/advanced_functionality/multi_model_catboost/multi_model_catboost.ipynb +++ b/advanced_functionality/multi_model_catboost/multi_model_catboost.ipynb @@ -9,7 +9,7 @@ "\n", "This example notebook showcases how to use a custom container to host multiple CatBoost models on a SageMaker Multi Model Endpoint. The model this notebook deploys is taken from this [CatBoost tutorial](https://github.com/catboost/tutorials/blob/master/python_tutorial_with_tasks.ipynb). \n", "\n", - "We are using Catboost model as an example to demostrate deployment and serving using MultiModel Endpoint and show case the capability. This notebook can be extended to any framework.\n", + "We are using this framework as an example to demonstrate deployment and serving using MultiModel Endpoint and showcase the capability. This notebook can be extended to any framework.\n", "\n", "Catboost is gaining in popularity and is not yet supported as a framework for SageMaker MultiModelEndpoint. Further this example serves to demostrate how to bring your own container to a MultiModelEndpoint\n", "\n", @@ -193,7 +193,7 @@ "```\n", "\n", "- `dockerd-entrypoint.py` is the entry point script that will start the multi model server.\n", - "- `Dockerfile` contains the container definition that will be used to assemble the image. This include the packages that need to be installed.\n", + "- `Dockerfile` contains the container definition that will be used to assemble the image. This includes the packages that need to be installed.\n", "- `model_handler.py` is the script that will contain the logic to load up the model and make inference.\n", "\n", "Take a look through the files to see if there is any customization that you would like to do.\n", @@ -469,7 +469,7 @@ "metadata": {}, "source": [ "### Invoke just one of models 1000 times \n", - "Since the moels will be in memory and loaded, these invocations will not have any latency \n" + "Since the models will be in memory and loaded, these invocations will not have any latency \n" ] }, { From b31be036a8eb0966ee7467e1e70158c8ceb81ac1 Mon Sep 17 00:00:00 2001 From: Vera Yu <88744664+verayu43@users.noreply.github.com> Date: Thu, 4 Aug 2022 11:46:26 -0700 Subject: [PATCH 3/5] Add UpdateFeatureGroup related APIs in sample notebook (#3515) --- ...re_store_introduction_customer_updated.csv | 5 + .../feature_store_introduction.ipynb | 147 ++++++++++++++++++ 2 files changed, 152 insertions(+) create mode 100644 sagemaker-featurestore/data/feature_store_introduction_customer_updated.csv diff --git a/sagemaker-featurestore/data/feature_store_introduction_customer_updated.csv b/sagemaker-featurestore/data/feature_store_introduction_customer_updated.csv new file mode 100644 index 0000000000..ab266bec28 --- /dev/null +++ b/sagemaker-featurestore/data/feature_store_introduction_customer_updated.csv @@ -0,0 +1,5 @@ +customer_id,city_code,state_code,country_code,email,name +573291,1,49,2,john.lee@gmail.com,John Lee +109382,2,40,2,olivequil@gmail.com,Olive Quil +828400,3,31,2,liz.knee@gmail.com,Liz Knee +124013,4,5,2,eileenbook@gmail.com,Eileen Book \ No newline at end of file diff --git a/sagemaker-featurestore/feature_store_introduction.ipynb b/sagemaker-featurestore/feature_store_introduction.ipynb index 77db4bf8a2..1f0a2a8e1b 100644 --- a/sagemaker-featurestore/feature_store_introduction.ipynb +++ b/sagemaker-featurestore/feature_store_introduction.ipynb @@ -473,6 +473,152 @@ "all_records" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Add features to a feature group\n", + "\n", + "If we want to update a FeatureGroup that has done the data ingestion, we can use the `UpdateFeatureGroup` API and then re-ingest data by using the updated dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.feature_store.feature_definition import StringFeatureDefinition\n", + "\n", + "customers_feature_group.update(\n", + " feature_additions=[StringFeatureDefinition(\"email\"), StringFeatureDefinition(\"name\")]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Verify the FeatureGroup has been updated successfully or not." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def check_last_update_status(feature_group):\n", + " last_update_status = feature_group.describe().get(\"LastUpdateStatus\")[\"Status\"]\n", + " while last_update_status == \"InProgress\":\n", + " print(\"Waiting for FeatureGroup to be updated\")\n", + " time.sleep(5)\n", + " last_update_status = feature_group.describe().get(\"LastUpdateStatus\")\n", + " if last_update_status == \"Successful\":\n", + " print(f\"FeatureGroup {feature_group.name} successfully updated.\")\n", + " else:\n", + " print(\n", + " f\"FeatureGroup {feature_group.name} updated failed. The LastUpdateStatus is\"\n", + " + str(last_update_status)\n", + " )\n", + "\n", + "\n", + "check_last_update_status(customers_feature_group)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inspect the new dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "customer_data_updated = pd.read_csv(\"data/feature_store_introduction_customer_updated.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "customer_data_updated.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Append `EventTime` feature to your data frame again." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "customer_data_updated[\"EventTime\"] = pd.Series(\n", + " [current_time_sec] * len(customer_data), dtype=\"float64\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ingest the new dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "customers_feature_group.ingest(data_frame=customer_data_updated, max_workers=3, wait=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use `batch_get_record` again to check that all updated data has been ingested into `customers_feature_group` by providing customer IDs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "updated_customers_records = sagemaker_session.boto_session.client(\n", + " \"sagemaker-featurestore-runtime\", region_name=region\n", + ").batch_get_record(\n", + " Identifiers=[\n", + " {\n", + " \"FeatureGroupName\": customers_feature_group_name,\n", + " \"RecordIdentifiersValueAsString\": [\"573291\", \"109382\", \"828400\", \"124013\"],\n", + " }\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "updated_customers_records" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -530,6 +676,7 @@ "* `delete()`\n", "* `create()`\n", "* `load_feature_definitions()`\n", + "* `update()`\n", "* `update_feature_metadata()`\n", "* `describe_feature_metadata()`\n", "\n", From 554a20c7a5973f4de4f52c246061bc3fc2dc3484 Mon Sep 17 00:00:00 2001 From: Julia Kroll <75504951+jkroll-aws@users.noreply.github.com> Date: Thu, 4 Aug 2022 14:00:00 -0500 Subject: [PATCH 4/5] Switch from tox to black-nb in linting section of contributing guide (#3527) * Revise linter section to switch from tox to black-nb * Update PR template from tox to black-nb --- .github/PULL_REQUEST_TEMPLATE.md | 2 +- CONTRIBUTING.md | 11 ++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index c7d1c2ece8..fbf4feee60 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -11,6 +11,6 @@ _Put an `x` in the boxes that apply. You can also fill these out after creating - [ ] I have read the [CONTRIBUTING](https://github.com/aws/amazon-sagemaker-examples/blob/master/CONTRIBUTING.md) doc and adhered to the example notebook best practices - [ ] I have updated any necessary documentation, including [READMEs](https://github.com/aws/amazon-sagemaker-examples/blob/master/README.md) - [ ] I have tested my notebook(s) and ensured it runs end-to-end -- [ ] I have linted my notebook(s) and code using `tox -e black-format,black-nb-format` +- [ ] I have linted my notebook(s) and code using `black-nb -l 100 {path}/{notebook-name}.ipynb` By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 9c096ab564..542a6ce35d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -36,12 +36,13 @@ Before sending us a pull request, please ensure that: 1. Clone your fork of the repository: `git clone https://github.com//amazon-sagemaker-examples` where `` is your github username. -### Run the Linters +### Run the Linter -1. Install tox using `pip install tox` -1. cd into the amazon-sagemaker-examples folder: `cd amazon-sagemaker-examples` or `cd /environment/amazon-sagemaker-examples` -1. Run the following tox command and verify that all linters pass: `tox -e black-check,black-nb-check` -1. If the linters did not pass, run the following tox command to fix the issues: `tox -e black-format,black-nb-format` +Apply Python code formatting to Jupyter notebook files using [black-nb](https://pypi.org/project/black-nb/). + +1. Install black-nb using `pip install black-nb` +1. Run the following black-nb command on each of your ipynb notebook files and verify that the linter passes: `black-nb -l 100 {path}/{notebook-name}.ipynb` +1. Some notebook features such as `%` bash commands or `%%` cell magic cause black-nb to fail. As long as you run the above command to format as much as possible, that is sufficient, even if the check fails ### Test Your Notebook End-to-End From bba768b9ddc5b40851f37378d2884d5cae547ac2 Mon Sep 17 00:00:00 2001 From: atqy <95724753+atqy@users.noreply.github.com> Date: Fri, 5 Aug 2022 13:04:08 -0700 Subject: [PATCH 5/5] fix links and incorrectly used code blocks (#3528) --- .../hpo_huggingface_text_classification_20_newsgroups.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hyperparameter_tuning/huggingface_multiclass_text_classification_20_newsgroups/hpo_huggingface_text_classification_20_newsgroups.ipynb b/hyperparameter_tuning/huggingface_multiclass_text_classification_20_newsgroups/hpo_huggingface_text_classification_20_newsgroups.ipynb index 0029cfa3d0..db4657f5bb 100644 --- a/hyperparameter_tuning/huggingface_multiclass_text_classification_20_newsgroups/hpo_huggingface_text_classification_20_newsgroups.ipynb +++ b/hyperparameter_tuning/huggingface_multiclass_text_classification_20_newsgroups/hpo_huggingface_text_classification_20_newsgroups.ipynb @@ -14,7 +14,7 @@ "Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. \n", "\n", "\n", - "This notebook demonstrates the use of the [HuggingFace `transformers` library](https://huggingface.co/transformers/) together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on multi class text classification. In particular, the pre-trained model will be fine-tuned using the [`20 newsgroups dataset`](http://qwone.com/~jason/20Newsgroups/). To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on." + "This notebook demonstrates the use of the [HuggingFace Transformers library](https://huggingface.co/transformers/) together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on multi class text classification. In particular, the pre-trained model will be fine-tuned using the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on." ] }, { @@ -107,7 +107,7 @@ "\n", "Now we'll download a dataset from the web on which we want to train the text classification model.\n", "\n", - "In this example, let us train the text classification model on the [`20 newsgroups dataset`](http://qwone.com/~jason/20Newsgroups/). The `20 newsgroups dataset` consists of 20000 messages taken from 20 Usenet newsgroups." + "In this example, let us train the text classification model on the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). The 20 Newsgroups dataset consists of 20000 messages taken from 20 Usenet newsgroups." ] }, { @@ -1040,7 +1040,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, let's define the SageMaker `HuggingFace` estimator with resource configurations and hyperparameters to train Text Classification on `20 newsgroups` dataset, running on a `p3.2xlarge` instance." + "Now, let's define the SageMaker `HuggingFace` estimator with resource configurations and hyperparameters to train Text Classification on 20 Newsgroups dataset, running on a `p3.2xlarge` instance." ] }, {