diff --git a/introduction_to_applying_machine_learning/README.md b/introduction_to_applying_machine_learning/README.md index 9b40687dc6..45416e7ddb 100644 --- a/introduction_to_applying_machine_learning/README.md +++ b/introduction_to_applying_machine_learning/README.md @@ -5,6 +5,7 @@ These examples provide a gentle introduction to machine learning concepts as they are applied in practical use cases across a variety of sectors. - [Predicting Customer Churn](xgboost_customer_churn) uses customer interaction and service usage data to find those most likely to churn, and then walks through the cost/benefit trade-offs of providing retention incentives. This uses Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost) to create a highly predictive model. +- [Predicting Customer Churn](lightgbm_catboost_tabtransformer_autogluon_churn) uses Amazon SageMaker's implementation of [LightGBM](https://lightgbm.readthedocs.io/en/latest/), [CatBoost](https://catboost.ai/), [TabTransformer](https://arxiv.org/abs/2012.06678), and [AutoGluon-Tabular](https://auto.gluon.ai/stable/index.html) with [SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) to create four predictive models on customer churn dataset, and evaluate their performance on the same test data. - [Cancer Prediction](breast_cancer_prediction) predicts Breast Cancer based on features derived from images, using SageMaker's Linear Learner. - [Ensembling](ensemble_modeling) predicts income using two Amazon SageMaker models to show the advantages in ensembling. - [Video Game Sales](video_game_sales) develops a binary prediction model for the success of video games based on review scores. diff --git a/introduction_to_applying_machine_learning/lightgbm_catboost_tabtransformer_autogluon_churn/churn-prediction-lightgbm-catboost-tabtransformer-autogluon.ipynb b/introduction_to_applying_machine_learning/lightgbm_catboost_tabtransformer_autogluon_churn/churn-prediction-lightgbm-catboost-tabtransformer-autogluon.ipynb new file mode 100644 index 0000000000..954455049b --- /dev/null +++ b/introduction_to_applying_machine_learning/lightgbm_catboost_tabtransformer_autogluon_churn/churn-prediction-lightgbm-catboost-tabtransformer-autogluon.ipynb @@ -0,0 +1,1857 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7ef64164", + "metadata": {}, + "source": [ + "# Customer Churn Prediction using Amazon SageMaker LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular with SageMaker AMT (Automatic Model Tuning)" + ] + }, + { + "cell_type": "markdown", + "id": "0ca3e116", + "metadata": {}, + "source": [ + "---\n", + "Losing customers is costly for any business. Identifying unhappy customers early on gives you a chance to offer them incentives to stay. This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.\n", + "\n", + "This notebook demonstrates the use of Amazon SageMaker’s implementation of the [LightGBM](https://lightgbm.readthedocs.io/en/latest/), [CatBoost](https://catboost.ai/en/docs/), [TabTransformer](https://arxiv.org/abs/2012.06678), and [AutoGluon-Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) algorithm to train and host a customer churn prediction model with [SageMaker AMT](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)(Automatic Model tuning).\n", + "\n", + "In this notebook, we demonstrate two use cases for each algorithm:\n", + "\n", + "* How to train a tabular model on the customer churn dataset with AMT.\n", + "* How to use the trained tabular model to perform inference, i.e., classifying new samples.\n", + "\n", + "In the end, we compare the performance of four models trained with AMT on the same test data.\n", + "\n", + "Note: This notebook was tested in Amazon SageMaker Studio on ml.t3.medium instance with Python 3 (Data Science) kernel.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "5291f501", + "metadata": {}, + "source": [ + "1. [Set Up](#1.-Set-Up)\n", + "2. [Data Preparation and Visualization](#2.-Data-Preparation-and-Visualization)\n", + "3. [Train A LightGBM Model with AMT](#3.-Train-A-LightGBM-Model-with-AMT)\n", + " * [Retrieve Training Artifacts](#3.1.-Retrieve-Training-Artifacts)\n", + " * [Set Training Parameters](#3.2.-Set-Training-Parameters)\n", + " * [Train with Automatic Model Tuning](#3.3.-Train-with-Automatic-Model-Tuning) \n", + " * [Start Training](#3.4.-Start-Training)\n", + " * [Deploy and Run Inference on the Trained Tabular Model](#3.5.-Deploy-and-Run-Inference-on-the-Trained-Tabular-Model)\n", + " * [Evaluate the Prediction Results Returned from the Endpoint](#3.6.-Evaluate-the-Prediction-Results-Returned-from-the-Endpoint)\n", + "4. [Train A CatBoost model with AMT](#4.-Train-A-CatBoost-model-with-AMT)\n", + " * [Train with Automatic Model Tuning](#4.1.-Train-with-Automatic-Model-Tuning) \n", + " * [Deploy and Run Inference on the Trained Tabular Model](#4.2.-Deploy-and-Run-Inference-on-the-Trained-Tabular-Model)\n", + "5. [Train A TabTransformer model with AMT](#5.-Train-A-TabTransformer-model-with-AMT)\n", + " * [Train with Automatic Model Tuning](#5.1.-Train-with-Automatic-Model-Tuning) \n", + " * [Deploy and Run Inference on the Trained Tabular Model](#5.2.-Deploy-and-Run-Inference-on-the-Trained-Tabular-Model)\n", + "6. [Train An AutoGluon-Tabular model](#6.-Train-An-AutoGluon-Tabular-model)\n", + " * [Train with AutoGluon-Tabular model](#6.1.-Train-with-AutoGluon-Tabular-model) \n", + " * [Deploy and Run Inference on the Trained Tabular Model](#6.2.-Deploy-and-Run-Inference-on-the-Trained-Tabular-Model)\n", + "7. [Compare Prediction Results of Four Trained Models on the Same Test Data](#7.-Compare-Prediction-Results-of-Four-Trained-Models-on-the-Same-Test-Data)" + ] + }, + { + "cell_type": "markdown", + "id": "62af3c2e", + "metadata": {}, + "source": [ + "## 1. Set Up\n", + "\n", + "---\n", + "Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "def1e09f", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install sagemaker ipywidgets --upgrade --quiet" + ] + }, + { + "cell_type": "markdown", + "id": "26a8ccde", + "metadata": {}, + "source": [ + "\n", + "---\n", + "To train and host on Amazon SageMaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7516a221", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker, boto3, json\n", + "from sagemaker import get_execution_role\n", + "\n", + "aws_role = get_execution_role()\n", + "aws_region = boto3.Session().region_name\n", + "sess = sagemaker.Session()\n", + "\n", + "bucket = sess.default_bucket()\n", + "prefix = \"sagemaker/DEMO-churn\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6087cdb0", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import io\n", + "import os\n", + "import sys\n", + "import time\n", + "import json\n", + "from IPython.display import display\n", + "from time import strftime, gmtime\n", + "from sagemaker.inputs import TrainingInput\n", + "from sagemaker.serializers import CSVSerializer\n", + "from sklearn import preprocessing" + ] + }, + { + "cell_type": "markdown", + "id": "efe1573f", + "metadata": {}, + "source": [ + "## 2. Data Preparation and Visualization\n", + "\n", + "Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes. After all, predicting the future is tricky business! But we’ll learn how to deal with prediction errors.\n", + "\n", + "The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. Let’s download and read that dataset in now:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "985aeaf4", + "metadata": {}, + "outputs": [], + "source": [ + "s3 = boto3.client(\"s3\")\n", + "s3.download_file(f\"sagemaker-sample-files\", \"datasets/tabular/synthetic/churn.txt\", \"churn.txt\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47abdc80", + "metadata": {}, + "outputs": [], + "source": [ + "churn = pd.read_csv(\"./churn.txt\")\n", + "pd.set_option(\"display.max_columns\", 500)\n", + "churn.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "f41f0f8a", + "metadata": {}, + "source": [ + "By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:\n", + "\n", + "`State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ\n", + "\n", + "`Account Length`: the number of days that this account has been active\n", + "\n", + "`Area Code`: the three-digit area code of the corresponding customer’s phone number\n", + "\n", + "`Phone`: the remaining seven-digit phone number\n", + "\n", + "`Int’l Plan`: whether the customer has an international calling plan: yes/no\n", + "\n", + "`VMail Plan`: whether the customer has a voice mail feature: yes/no\n", + "\n", + "`VMail Message`: the average number of voice mail messages per month\n", + "\n", + "`Day Mins`: the total number of calling minutes used during the day\n", + "\n", + "`Day Calls`: the total number of calls placed during the day\n", + "\n", + "`Day Charge`: the billed cost of daytime calls\n", + "\n", + "`Eve Mins`, `Eve Calls`, `Eve Charge`: the billed cost for calls placed during the evening\n", + "\n", + "`Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime\n", + "\n", + "`Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls\n", + "\n", + "`CustServ Calls`: the number of calls placed to Customer Service\n", + "\n", + "`Churn?`: whether the customer left the service: true/false\n", + "\n", + "The last attribute, `Churn?`, is known as the target attribute: the attribute that we want the ML model to predict. Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.\n", + "\n", + "Let’s begin exploring the data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ddb61970", + "metadata": {}, + "outputs": [], + "source": [ + "# Histograms for each numeric features\n", + "display(churn.describe())\n", + "%matplotlib inline\n", + "hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))" + ] + }, + { + "cell_type": "markdown", + "id": "a2339e7e", + "metadata": {}, + "source": [ + "We can see immediately that: - `State` appears to be quite evenly distributed. - `Phone` takes on too many unique values to be of any practical use. It’s possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it. - Most of the numeric features are surprisingly nicely distributed, with many showing bell-like gaussianity. `VMail Message` is a notable exception." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfb7d029", + "metadata": {}, + "outputs": [], + "source": [ + "churn = churn.drop(\"Phone\", axis=1)\n", + "churn[\"Area Code\"] = churn[\"Area Code\"].astype(object)" + ] + }, + { + "cell_type": "markdown", + "id": "7100fb95", + "metadata": {}, + "source": [ + "Next let’s look at the relationship between each of the features and our target variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5f5b300", + "metadata": {}, + "outputs": [], + "source": [ + "for column in churn.select_dtypes(include=[\"object\"]).columns:\n", + " if column != \"Churn?\":\n", + " display(pd.crosstab(index=churn[column], columns=churn[\"Churn?\"], normalize=\"columns\"))\n", + "\n", + "for column in churn.select_dtypes(exclude=[\"object\"]).columns:\n", + " print(column)\n", + " hist = churn[[column, \"Churn?\"]].hist(by=\"Churn?\", bins=30)\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "ead0fead", + "metadata": {}, + "source": [ + "We convert the target attribute to binary value and move it to the first column of the dataset to meet requirements of SageMaker built-in tabular algorithms (For an example, see [SageMaker LightGBM documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm.html))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df47dff8", + "metadata": {}, + "outputs": [], + "source": [ + "churn[\"target\"] = churn[\"Churn?\"].map({\"True.\": 1, \"False.\": 0})\n", + "churn.drop([\"Churn?\"], axis=1, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9769380f", + "metadata": {}, + "outputs": [], + "source": [ + "churn = churn[[\"target\"] + churn.columns.tolist()[:-1]]" + ] + }, + { + "cell_type": "markdown", + "id": "076403fe", + "metadata": {}, + "source": [ + "We identify the column indexes of the categorical attribute, which is required by LightGBM, CatBoost, and TabTransformer algorithm (AutoGluon-Tabular has built-in feature engineering to identify the categorical attribute automatically, and thus does not require such input)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0421ab18", + "metadata": {}, + "outputs": [], + "source": [ + "cat_columns = [\n", + " \"State\",\n", + " \"Account Length\",\n", + " \"Area Code\",\n", + " \"Phone\",\n", + " \"Int'l Plan\",\n", + " \"VMail Plan\",\n", + " \"VMail Message\",\n", + " \"Day Calls\",\n", + " \"Eve Calls\",\n", + " \"Night Calls\",\n", + " \"Intl Calls\",\n", + " \"CustServ Calls\",\n", + "]\n", + "\n", + "cat_idx = []\n", + "for idx, col_name in enumerate(churn.columns.tolist()):\n", + " if col_name in cat_columns:\n", + " cat_idx.append(idx)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a865ba04", + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"cat_idx.json\", \"w\") as outfile:\n", + " json.dump({\"cat_idx\": cat_idx}, outfile)" + ] + }, + { + "cell_type": "markdown", + "id": "4092e255", + "metadata": {}, + "source": [ + "[LightGBM official documentation](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support) requires that all categorical features should be encoded as non-negative integers. We do it consistently for all the other algorithms." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "740e6b02", + "metadata": {}, + "outputs": [], + "source": [ + "for idx, col_name in enumerate(churn.columns.tolist()):\n", + " if col_name in cat_columns:\n", + " le = preprocessing.LabelEncoder()\n", + " churn[col_name] = le.fit_transform(churn[col_name])" + ] + }, + { + "cell_type": "markdown", + "id": "0a11d76b", + "metadata": {}, + "source": [ + "We split the churn dataset into train, validation, and test set using stratified sampling. Validation set is used for early stopping and AMT. Test set is used for performance evaluations in the end. Next, we upload them into a S3 path for training.\n", + "\n", + "The structure of the S3 path for training should be structured as below. The `cat_idx.json` is categorical column indexes.\n", + "\n", + "-- `train`
\n", + "      -- `data.csv`
\n", + "-- `validation`
\n", + "      -- `data.csv` \n", + "-- `cat_idx.json`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fee4296f", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "train, val_n_test = train_test_split(\n", + " churn, test_size=0.3, random_state=42, stratify=churn[\"target\"]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48080aca", + "metadata": {}, + "outputs": [], + "source": [ + "val, test = train_test_split(\n", + " val_n_test, test_size=0.3, random_state=42, stratify=val_n_test[\"target\"]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1771b769", + "metadata": {}, + "outputs": [], + "source": [ + "train.to_csv(\"train.csv\", header=False, index=False)\n", + "val.to_csv(\"validation.csv\", header=False, index=False)\n", + "test.to_csv(\"test.csv\", header=False, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c26e7053", + "metadata": {}, + "outputs": [], + "source": [ + "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n", + " os.path.join(prefix, \"train/data.csv\")\n", + ").upload_file(\"train.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c297dff2", + "metadata": {}, + "outputs": [], + "source": [ + "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n", + " os.path.join(prefix, \"validation/data.csv\")\n", + ").upload_file(\"validation.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3cb55d7a", + "metadata": {}, + "outputs": [], + "source": [ + "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n", + " os.path.join(prefix, \"test/data.csv\")\n", + ").upload_file(\"test.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "042b6f55", + "metadata": {}, + "outputs": [], + "source": [ + "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n", + " os.path.join(prefix, \"cat_idx.json\")\n", + ").upload_file(\"cat_idx.json\")" + ] + }, + { + "cell_type": "markdown", + "id": "b278de2a", + "metadata": {}, + "source": [ + "## 3. Train A LightGBM Model with AMT" + ] + }, + { + "cell_type": "markdown", + "id": "26d18ad9", + "metadata": {}, + "source": [ + "### 3.1. Retrieve Training Artifacts\n", + "\n", + "___\n", + "\n", + "Here, we retrieve the training docker container, the training algorithm source, and the tabular algorithm. Note that model_version=\"*\" fetches the latest model.\n", + "\n", + "For the training algorithm, we have four choices in this demonstration for classification task.\n", + "* [LightGBM](https://lightgbm.readthedocs.io/en/latest/): To use this algorithm, specify `train_model_id` as `lightgbm-classification-model` in the cell below.\n", + "* [CatBoost](https://catboost.ai/en/docs/): To use this algorithm, specify `train_model_id` as `catboost-classification-model` in the cell below.\n", + "* [TabTransformer](https://arxiv.org/abs/2012.06678): To use this algorithm, specify `train_model_id` as `pytorch-tabtransformerclassification-model` in the cell below.\n", + "* [AutoGluon Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html): To use this algorithm, specify `train_model_id` as `autogluon-classification-ensemble` in the cell below.\n", + "\n", + "Note. [XGBoost](https://xgboost.readthedocs.io/en/latest/) (`train_model_id: xgboost-classification-model`) and [Linear Learner](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) (`train_model_id: sklearn-classification-linear`) are the other choices in the tabular classification category. Since they have different input-format requirements, please check separate notebooks `xgboost_linear_learner_tabular/Amazon_Tabular_Classification_XGBoost_LinearLearner.ipynb`, `tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb`, and `autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb` for details.\n", + "\n", + "For regression task, you just need replace `classification` in the `train_model_id` with `regression`.\n", + "\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0ad11b96", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import image_uris, model_uris, script_uris\n", + "\n", + "train_model_id, train_model_version, train_scope = \"lightgbm-classification-model\", \"*\", \"training\"\n", + "training_instance_type = \"ml.m5.4xlarge\"\n", + "\n", + "# Retrieve the docker image\n", + "train_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " image_scope=train_scope,\n", + " instance_type=training_instance_type,\n", + ")\n", + "# Retrieve the training script\n", + "train_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=train_scope\n", + ")\n", + "# Retrieve the pre-trained model tarball to further fine-tune\n", + "train_model_uri = model_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, model_scope=train_scope\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e8a4d3d3", + "metadata": {}, + "source": [ + "### 3.2. Set Training Parameters\n", + "\n", + "---\n", + "\n", + "Now that we are done with all the setup that is needed, we are ready to train our tabular algorithm. To begin, let us create a [``sageMaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. \n", + "\n", + "There are two kinds of parameters that need to be set for training. The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training.\n", + "\n", + "The second set of parameters are algorithm specific training hyper-parameters. \n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a1f8559", + "metadata": {}, + "outputs": [], + "source": [ + "training_dataset_s3_path = f\"s3://{bucket}/{prefix}\"\n", + "\n", + "output_prefix = \"jumpstart-example-tabular-training\"\n", + "s3_output_location = f\"s3://{bucket}/{output_prefix}/output_lgb\"" + ] + }, + { + "cell_type": "markdown", + "id": "8828563c", + "metadata": {}, + "source": [ + "---\n", + "For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values. For the evaluation metric that is used by early stopping and automatic model tuning, we choose `auc` score. Note. LightGBM does not have built-in F1 score supported. See [LightGBM documentation](https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric-parameters).\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cd5d2fd", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import hyperparameters\n", + "\n", + "# Retrieve the default hyper-parameters for fine-tuning the model\n", + "hyperparameters = hyperparameters.retrieve_default(\n", + " model_id=train_model_id, model_version=train_model_version\n", + ")\n", + "\n", + "# [Optional] Override default hyperparameters with custom values\n", + "hyperparameters[\n", + " \"num_boost_round\"\n", + "] = \"500\" # The same hyperparameter is named as \"iterations\" for CatBoost\n", + "\n", + "\n", + "hyperparameters[\"metric\"] = \"auc\"\n", + "print(hyperparameters)" + ] + }, + { + "cell_type": "markdown", + "id": "f43ec07c", + "metadata": {}, + "source": [ + "### 3.3. Train with Automatic Model Tuning \n", + "\n", + "\n", + "Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a HyperparameterTuner object to interact with Amazon SageMaker hyperparameter tuning APIs.\n", + "\n", + "* Note. In this notebook, we set AMT budget (total tuning jobs) as 10 for each of the tabular algorithm except AutoGluon-Tabular. For [AutoGluon-Tabular](https://arxiv.org/abs/2003.06505), it succeeds by ensembling multiple models and stacking them in multiple layers. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b136b897", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner\n", + "\n", + "use_amt = True\n", + "\n", + "hyperparameter_ranges_lgb = {\n", + " \"learning_rate\": ContinuousParameter(1e-4, 1, scaling_type=\"Logarithmic\"),\n", + " \"num_boost_round\": IntegerParameter(2, 30),\n", + " \"num_leaves\": IntegerParameter(10, 50),\n", + " \"feature_fraction\": ContinuousParameter(0, 1),\n", + " \"bagging_fraction\": ContinuousParameter(0, 1),\n", + " \"bagging_freq\": IntegerParameter(1, 10),\n", + " \"max_depth\": IntegerParameter(5, 30),\n", + " \"min_data_in_leaf\": IntegerParameter(5, 50),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "f209be30", + "metadata": {}, + "source": [ + "### 3.4. Start Training" + ] + }, + { + "cell_type": "markdown", + "id": "caf86ae9", + "metadata": {}, + "source": [ + "---\n", + "We start by creating the estimator object with all the required assets and then launch the training job.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c6d9bab", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.estimator import Estimator\n", + "from sagemaker.utils import name_from_base\n", + "\n", + "training_job_name = name_from_base(f\"jumpstart-{train_model_id}-train\")\n", + "\n", + "# Create SageMaker Estimator instance\n", + "tabular_estimator = Estimator(\n", + " role=aws_role,\n", + " image_uri=train_image_uri,\n", + " source_dir=train_source_uri,\n", + " model_uri=train_model_uri,\n", + " entry_point=\"transfer_learning.py\",\n", + " instance_count=1,\n", + " instance_type=training_instance_type,\n", + " max_run=360000,\n", + " hyperparameters=hyperparameters,\n", + " output_path=s3_output_location,\n", + ")\n", + "\n", + "if use_amt:\n", + "\n", + " tuner = HyperparameterTuner(\n", + " tabular_estimator,\n", + " \"auc\",\n", + " hyperparameter_ranges_lgb,\n", + " [{\"Name\": \"auc\", \"Regex\": \"auc: ([0-9\\\\.]+)\"}],\n", + " max_jobs=10,\n", + " max_parallel_jobs=5,\n", + " objective_type=\"Maximize\",\n", + " base_tuning_job_name=training_job_name,\n", + " )\n", + "\n", + " tuner.fit({\"training\": training_dataset_s3_path}, logs=True)\n", + "else:\n", + " # Launch a SageMaker Training job by passing s3 path of the training data\n", + " tabular_estimator.fit(\n", + " {\"training\": training_dataset_s3_path}, logs=True, job_name=training_job_name\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "1f1c8f37", + "metadata": {}, + "source": [ + "### 3.5. Deploy and Run Inference on the Trained Tabular Model\n", + "\n", + "---\n", + "\n", + "In this section, you learn how to query an existing endpoint and make predictions of the examples you input. For each example, the model will output the probability of the sample for each class in the model. \n", + "Next, the predicted class label is obtained by taking the class label with the maximum probability over others.\n", + "\n", + "\n", + "We start by retrieving the artifacts and deploy the `tabular_estimator` that we trained.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0d18d65", + "metadata": {}, + "outputs": [], + "source": [ + "inference_instance_type = \"ml.m5.large\"\n", + "\n", + "# Retrieve the inference docker container uri\n", + "deploy_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " image_scope=\"inference\",\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " instance_type=inference_instance_type,\n", + ")\n", + "# Retrieve the inference script uri\n", + "deploy_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=\"inference\"\n", + ")\n", + "\n", + "endpoint_name = name_from_base(f\"jumpstart-lgb-churn-{train_model_id}-\")\n", + "\n", + "# Use the estimator from the previous step to deploy to a SageMaker endpoint\n", + "predictor = (tuner if use_amt else tabular_estimator).deploy(\n", + " initial_instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " entry_point=\"inference.py\",\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " endpoint_name=endpoint_name,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "57a3c147", + "metadata": {}, + "source": [ + "---\n", + "Next, we read the customer churn test data into pandas data frame, prepare the ground truth target and predicting features to send into the endpoint. \n", + "\n", + "Below is the screenshot of the first 5 examples in the test set.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0f8fdb7", + "metadata": {}, + "outputs": [], + "source": [ + "newline, bold, unbold = \"\\n\", \"\\033[1m\", \"\\033[0m\"\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn.metrics import accuracy_score, f1_score, roc_auc_score\n", + "from sklearn.metrics import confusion_matrix\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# read the data\n", + "test_data_file_name = \"test.csv\"\n", + "test_data = pd.read_csv(test_data_file_name, header=None)\n", + "test_data.columns = [\"Target\"] + [f\"Feature_{i}\" for i in range(1, test_data.shape[1])]\n", + "\n", + "num_examples, num_columns = test_data.shape\n", + "print(\n", + " f\"{bold}The test dataset contains {num_examples} examples and {num_columns} columns.{unbold}\\n\"\n", + ")\n", + "\n", + "# prepare the ground truth target and predicting features to send into the endpoint.\n", + "ground_truth_label, features = test_data.iloc[:, :1], test_data.iloc[:, 1:]\n", + "\n", + "print(f\"{bold}The first 5 observations of the data: {unbold} \\n\")\n", + "test_data.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "4f628562", + "metadata": {}, + "source": [ + "---\n", + "The following code queries the endpoint you have created to get the prediction for each test example. \n", + "The `query_endpoint()` function returns an array-like of shape (num_examples, num_classes), where each row indicates \n", + "the probability of the example for each class in the model. The num_classes is 2 in above test data. \n", + "Next, the predicted class label is obtained by taking the class label with the maximum probability over others for each example. \n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da19a629", + "metadata": {}, + "outputs": [], + "source": [ + "content_type = \"text/csv\"\n", + "\n", + "\n", + "def query_endpoint(encoded_tabular_data, endpoint_name):\n", + " client = boto3.client(\"runtime.sagemaker\")\n", + " response = client.invoke_endpoint(\n", + " EndpointName=endpoint_name,\n", + " ContentType=content_type,\n", + " Body=encoded_tabular_data,\n", + " )\n", + " return response\n", + "\n", + "\n", + "def parse_response(query_response):\n", + " model_predictions = json.loads(query_response[\"Body\"].read())\n", + " predicted_probabilities = model_predictions[\"probabilities\"]\n", + " return np.array(predicted_probabilities)\n", + "\n", + "\n", + "# split the test data into smaller size of batches to query the endpoint if test data has large size.\n", + "batch_size = 1500\n", + "predict_prob = []\n", + "for i in np.arange(0, num_examples, step=batch_size):\n", + " query_response_batch = query_endpoint(\n", + " features.iloc[i : (i + batch_size), :].to_csv(header=False, index=False).encode(\"utf-8\"),\n", + " endpoint_name,\n", + " )\n", + " predict_prob_batch = parse_response(query_response_batch) # prediction probability per batch\n", + " predict_prob.append(predict_prob_batch)\n", + "\n", + "\n", + "predict_prob = np.concatenate(predict_prob, axis=0)\n", + "predict_label = np.argmax(predict_prob, axis=1)" + ] + }, + { + "cell_type": "markdown", + "id": "aabdee3e", + "metadata": {}, + "source": [ + "## 3.6. Evaluate the Prediction Results Returned from the Endpoint\n", + "\n", + "---\n", + "We evaluate the predictions results returned from the endpoint by following two ways.\n", + "\n", + "* Visualize the predictions results by plotting the confusion matrix.\n", + "\n", + "* Measure the prediction results quantitatively.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f3610bc", + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize the predictions results by plotting the confusion matrix.\n", + "conf_matrix = confusion_matrix(y_true=ground_truth_label.values, y_pred=predict_label)\n", + "fig, ax = plt.subplots(figsize=(7.5, 7.5))\n", + "ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)\n", + "for i in range(conf_matrix.shape[0]):\n", + " for j in range(conf_matrix.shape[1]):\n", + " ax.text(x=j, y=i, s=conf_matrix[i, j], va=\"center\", ha=\"center\", size=\"xx-large\")\n", + "\n", + "plt.xlabel(\"Predictions\", fontsize=18)\n", + "plt.ylabel(\"Actuals\", fontsize=18)\n", + "plt.title(\"Confusion Matrix\", fontsize=18)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a59c801e", + "metadata": {}, + "outputs": [], + "source": [ + "# Measure the prediction results quantitatively.\n", + "eval_accuracy = accuracy_score(ground_truth_label.values, predict_label)\n", + "eval_f1 = f1_score(ground_truth_label.values, predict_label)\n", + "eval_auc = roc_auc_score(ground_truth_label.values, predict_prob[:, 1])\n", + "\n", + "lgb_results = pd.DataFrame.from_dict(\n", + " {\n", + " \"Accuracy\": eval_accuracy,\n", + " \"F1\": eval_f1,\n", + " \"AUC\": eval_auc,\n", + " },\n", + " orient=\"index\",\n", + " columns=[\"LightGBM with AMT\"],\n", + ")\n", + "\n", + "lgb_results" + ] + }, + { + "cell_type": "markdown", + "id": "ab7d2f6d", + "metadata": {}, + "source": [ + "## 4. Train A CatBoost model with AMT\n" + ] + }, + { + "cell_type": "markdown", + "id": "c49487ca", + "metadata": {}, + "source": [ + "### 4.1. Train with Automatic Model Tuning\n" + ] + }, + { + "cell_type": "markdown", + "id": "0e3350a3", + "metadata": {}, + "source": [ + "Retrieve Training Artifacts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a67cce3a", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import image_uris, model_uris, script_uris\n", + "\n", + "train_model_id, train_model_version, train_scope = \"catboost-classification-model\", \"*\", \"training\"\n", + "training_instance_type = \"ml.m5.4xlarge\"\n", + "\n", + "# Retrieve the docker image\n", + "train_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " image_scope=train_scope,\n", + " instance_type=training_instance_type,\n", + ")\n", + "# Retrieve the training script\n", + "train_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=train_scope\n", + ")\n", + "# Retrieve the pre-trained model tarball to further fine-tune\n", + "train_model_uri = model_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, model_scope=train_scope\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5798369b", + "metadata": {}, + "source": [ + "Set training parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7ada7ad", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import hyperparameters\n", + "\n", + "# Retrieve the default hyper-parameters for fine-tuning the model\n", + "hyperparameters = hyperparameters.retrieve_default(\n", + " model_id=train_model_id, model_version=train_model_version\n", + ")\n", + "\n", + "# [Optional] Override default hyperparameters with custom values\n", + "hyperparameters[\"iterations\"] = \"500\"\n", + "\n", + "\n", + "hyperparameters[\"eval_metric\"] = \"AUC\"\n", + "print(hyperparameters)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c02b4aa", + "metadata": {}, + "outputs": [], + "source": [ + "s3_output_location_cat = f\"s3://{bucket}/{output_prefix}/output_cat\"" + ] + }, + { + "cell_type": "markdown", + "id": "c69d3f7a", + "metadata": {}, + "source": [ + "Train with Automatic Model Tuning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0cd0ac41", + "metadata": {}, + "outputs": [], + "source": [ + "hyperparameter_ranges_cat = {\n", + " \"learning_rate\": ContinuousParameter(0.00001, 0.1, scaling_type=\"Logarithmic\"),\n", + " \"iterations\": IntegerParameter(50, 1000),\n", + " \"depth\": IntegerParameter(1, 10),\n", + " \"l2_leaf_reg\": IntegerParameter(1, 10),\n", + " \"random_strength\": ContinuousParameter(0.01, 10, scaling_type=\"Logarithmic\"),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "17053327", + "metadata": {}, + "source": [ + "Start training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb34ccbe", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.estimator import Estimator\n", + "from sagemaker.utils import name_from_base\n", + "\n", + "training_job_name = name_from_base(f\"jumpstart-{train_model_id}-training\")\n", + "\n", + "# Create SageMaker Estimator instance\n", + "tabular_estimator_cat = Estimator(\n", + " role=aws_role,\n", + " image_uri=train_image_uri,\n", + " source_dir=train_source_uri,\n", + " model_uri=train_model_uri,\n", + " entry_point=\"transfer_learning.py\",\n", + " instance_count=1,\n", + " instance_type=training_instance_type,\n", + " max_run=360000,\n", + " hyperparameters=hyperparameters,\n", + " output_path=s3_output_location_cat,\n", + ")\n", + "\n", + "if use_amt:\n", + "\n", + " tuner_cat = HyperparameterTuner(\n", + " tabular_estimator_cat,\n", + " \"AUC\",\n", + " hyperparameter_ranges_cat,\n", + " [{\"Name\": \"AUC\", \"Regex\": \"bestTest = ([0-9\\\\.]+)\"}],\n", + " max_jobs=10,\n", + " max_parallel_jobs=5,\n", + " objective_type=\"Maximize\",\n", + " base_tuning_job_name=training_job_name,\n", + " )\n", + "\n", + " tuner_cat.fit({\"training\": training_dataset_s3_path}, logs=True)\n", + "else:\n", + " # Launch a SageMaker Training job by passing s3 path of the training data\n", + " tabular_estimator_cat.fit(\n", + " {\"training\": training_dataset_s3_path}, logs=True, job_name=training_job_name\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "33ad5e7a", + "metadata": {}, + "source": [ + "### 4.2. Deploy and Run Inference on the Trained Tabular Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2159fc95", + "metadata": {}, + "outputs": [], + "source": [ + "inference_instance_type = \"ml.m5.large\"\n", + "\n", + "# Retrieve the inference docker container uri\n", + "deploy_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " image_scope=\"inference\",\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " instance_type=inference_instance_type,\n", + ")\n", + "# Retrieve the inference script uri\n", + "deploy_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=\"inference\"\n", + ")\n", + "\n", + "endpoint_name_cat = name_from_base(f\"jumpstart-cat-churn-{train_model_id}-\")\n", + "\n", + "# Use the estimator from the previous step to deploy to a SageMaker endpoint\n", + "predictor_cat = (tuner_cat if use_amt else tabular_estimator_cat).deploy(\n", + " initial_instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " entry_point=\"inference.py\",\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " endpoint_name=endpoint_name_cat,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fd36650b", + "metadata": {}, + "source": [ + "Query the endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa560463", + "metadata": {}, + "outputs": [], + "source": [ + "# split the test data into smaller size of batches to query the endpoint if the test data has large size.\n", + "batch_size = 1500\n", + "predict_prob_cat = []\n", + "for i in np.arange(0, num_examples, step=batch_size):\n", + " query_response_batch = query_endpoint(\n", + " features.iloc[i : (i + batch_size), :].to_csv(header=False, index=False).encode(\"utf-8\"),\n", + " endpoint_name_cat,\n", + " )\n", + " predict_prob_batch = parse_response(query_response_batch) # prediction probability per batch\n", + " predict_prob_cat.append(predict_prob_batch)\n", + "\n", + "\n", + "predict_prob_cat = np.concatenate(predict_prob_cat, axis=0)\n", + "predict_label_cat = np.argmax(predict_prob_cat, axis=1)" + ] + }, + { + "cell_type": "markdown", + "id": "3c62c458", + "metadata": {}, + "source": [ + "Evaluate the prediction results returned from the endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b012badc", + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize the predictions results by plotting the confusion matrix.\n", + "conf_matrix = confusion_matrix(y_true=ground_truth_label.values, y_pred=predict_label_cat)\n", + "fig, ax = plt.subplots(figsize=(7.5, 7.5))\n", + "ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)\n", + "for i in range(conf_matrix.shape[0]):\n", + " for j in range(conf_matrix.shape[1]):\n", + " ax.text(x=j, y=i, s=conf_matrix[i, j], va=\"center\", ha=\"center\", size=\"xx-large\")\n", + "\n", + "plt.xlabel(\"Predictions\", fontsize=18)\n", + "plt.ylabel(\"Actuals\", fontsize=18)\n", + "plt.title(\"Confusion Matrix\", fontsize=18)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1e6c3a0", + "metadata": {}, + "outputs": [], + "source": [ + "# Measure the prediction results quantitatively.\n", + "eval_accuracy_cat = accuracy_score(ground_truth_label.values, predict_label_cat)\n", + "eval_f1_cat = f1_score(ground_truth_label.values, predict_label_cat)\n", + "eval_auc_cat = roc_auc_score(ground_truth_label.values, predict_prob_cat[:, 1])\n", + "\n", + "cat_results = pd.DataFrame.from_dict(\n", + " {\n", + " \"Accuracy\": eval_accuracy_cat,\n", + " \"F1\": eval_f1_cat,\n", + " \"AUC\": eval_auc_cat,\n", + " },\n", + " orient=\"index\",\n", + " columns=[\"CatBoost with AMT\"],\n", + ")\n", + "\n", + "results_lab_cat = pd.concat([lgb_results, cat_results], axis=1)\n", + "results_lab_cat" + ] + }, + { + "cell_type": "markdown", + "id": "026fc463", + "metadata": {}, + "source": [ + "## 5. Train A TabTransformer model with AMT" + ] + }, + { + "cell_type": "markdown", + "id": "a618e4af", + "metadata": {}, + "source": [ + "### 5.1. Train with Automatic Model Tuning" + ] + }, + { + "cell_type": "markdown", + "id": "f20e80bc", + "metadata": {}, + "source": [ + "Retrieve Training Artifacts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f420b4d1", + "metadata": {}, + "outputs": [], + "source": [ + "train_model_id, train_model_version, train_scope = (\n", + " \"pytorch-tabtransformerclassification-model\",\n", + " \"*\",\n", + " \"training\",\n", + ")\n", + "training_instance_type = \"ml.p3.2xlarge\"\n", + "\n", + "# Retrieve the docker image\n", + "train_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " image_scope=train_scope,\n", + " instance_type=training_instance_type,\n", + ")\n", + "# Retrieve the training script\n", + "train_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=train_scope\n", + ")\n", + "# Retrieve the pre-trained model tarball to further fine-tune\n", + "train_model_uri = model_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, model_scope=train_scope\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e133b1ed", + "metadata": {}, + "source": [ + "Set training parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e4348e5", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import hyperparameters\n", + "\n", + "# Retrieve the default hyper-parameters for fine-tuning the model\n", + "hyperparameters = hyperparameters.retrieve_default(\n", + " model_id=train_model_id, model_version=train_model_version\n", + ")\n", + "\n", + "# [Optional] Override default hyperparameters with custom values\n", + "hyperparameters[\"n_epochs\"] = 40 # The same hyperparameter is named as \"iterations\" for CatBoost\n", + "hyperparameters[\"patience\"] = 10\n", + "\n", + "print(hyperparameters)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0079c15c", + "metadata": {}, + "outputs": [], + "source": [ + "s3_output_location_tab = f\"s3://{bucket}/{output_prefix}/output_tab\"" + ] + }, + { + "cell_type": "markdown", + "id": "7bcce249", + "metadata": {}, + "source": [ + "Train with Automatic Model Tuning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9baa338", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tuner import (\n", + " ContinuousParameter,\n", + " IntegerParameter,\n", + " HyperparameterTuner,\n", + " CategoricalParameter,\n", + ")\n", + "\n", + "hyperparameter_ranges_tab = {\n", + " \"learning_rate\": ContinuousParameter(0.001, 0.01, scaling_type=\"Auto\"),\n", + " \"batch_size\": CategoricalParameter([64, 128, 256, 512]),\n", + " \"attn_dropout\": ContinuousParameter(0.0, 0.8, scaling_type=\"Auto\"),\n", + " \"mlp_dropout\": ContinuousParameter(0.0, 0.8, scaling_type=\"Auto\"),\n", + " \"input_dim\": CategoricalParameter([\"16\", \"32\", \"64\", \"128\", \"256\"]),\n", + " \"frac_shared_embed\": ContinuousParameter(0.0, 0.5, scaling_type=\"Auto\"),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "edba0682", + "metadata": {}, + "source": [ + "Start training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c1b2be2", + "metadata": {}, + "outputs": [], + "source": [ + "training_job_name = name_from_base(f\"jumpstart-{train_model_id}-training\")\n", + "\n", + "# Create SageMaker Estimator instance\n", + "tabular_estimator_tab = Estimator(\n", + " role=aws_role,\n", + " image_uri=train_image_uri,\n", + " source_dir=train_source_uri,\n", + " model_uri=train_model_uri,\n", + " entry_point=\"transfer_learning.py\",\n", + " instance_count=1,\n", + " instance_type=training_instance_type,\n", + " max_run=360000,\n", + " hyperparameters=hyperparameters,\n", + " output_path=s3_output_location_tab,\n", + ")\n", + "\n", + "if use_amt:\n", + "\n", + " tuner_tab = HyperparameterTuner(\n", + " tabular_estimator_tab,\n", + " \"f1_score\", # Note, TabTransformer currently does not support AUC score, thus we use its default setting F1 score as an alternative evaluation metric.\n", + " hyperparameter_ranges_tab,\n", + " [{\"Name\": \"f1_score\", \"Regex\": \"metrics={'f1': (\\\\S+)}\"}],\n", + " max_jobs=10,\n", + " max_parallel_jobs=5, # reduce max_parallel_jobs number if the instance type is limited in your account\n", + " objective_type=\"Maximize\",\n", + " base_tuning_job_name=training_job_name,\n", + " )\n", + "\n", + " tuner_tab.fit({\"training\": training_dataset_s3_path}, logs=True)\n", + "else:\n", + " # Launch a SageMaker Training job by passing s3 path of the training data\n", + " tabular_estimator_tab.fit(\n", + " {\"training\": training_dataset_s3_path}, logs=True, job_name=training_job_name\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "4f5a8b89", + "metadata": {}, + "source": [ + " \n", + "### 5.2. Deploy and Run Inference on the Trained Tabular Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5d1d6afb", + "metadata": {}, + "outputs": [], + "source": [ + "inference_instance_type = \"ml.m5.2xlarge\"\n", + "\n", + "# Retrieve the inference docker container uri\n", + "deploy_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " image_scope=\"inference\",\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " instance_type=inference_instance_type,\n", + ")\n", + "# Retrieve the inference script uri\n", + "deploy_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=\"inference\"\n", + ")\n", + "\n", + "endpoint_name_tab = name_from_base(f\"jumpstart-tabtransformer-churn-{train_model_id}-\")\n", + "\n", + "# Use the estimator from the previous step to deploy to a SageMaker endpoint\n", + "predictor_tab = (tuner_tab if use_amt else tabular_estimator_tab).deploy(\n", + " initial_instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " entry_point=\"inference.py\",\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " endpoint_name=endpoint_name_tab,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e6d70ad", + "metadata": {}, + "outputs": [], + "source": [ + "# split the test data into smaller size of batches to query the endpoint if the test data has large size.\n", + "batch_size = 1500\n", + "predict_prob_tab = []\n", + "for i in np.arange(0, num_examples, step=batch_size):\n", + " query_response_batch = query_endpoint(\n", + " features.iloc[i : (i + batch_size), :].to_csv(header=False, index=False).encode(\"utf-8\"),\n", + " endpoint_name_tab,\n", + " )\n", + " predict_prob_batch = parse_response(query_response_batch) # prediction probability per batch\n", + " predict_prob_tab.append(predict_prob_batch)\n", + "\n", + "\n", + "predict_prob_tab = np.concatenate(predict_prob_tab, axis=0)\n", + "predict_label_tab = np.argmax(predict_prob_tab, axis=1)" + ] + }, + { + "cell_type": "markdown", + "id": "c7533d36", + "metadata": {}, + "source": [ + "Evaluate the prediction results returned from the endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "641f8234", + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize the predictions results by plotting the confusion matrix.\n", + "conf_matrix = confusion_matrix(y_true=ground_truth_label.values, y_pred=predict_label_tab)\n", + "fig, ax = plt.subplots(figsize=(7.5, 7.5))\n", + "ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)\n", + "for i in range(conf_matrix.shape[0]):\n", + " for j in range(conf_matrix.shape[1]):\n", + " ax.text(x=j, y=i, s=conf_matrix[i, j], va=\"center\", ha=\"center\", size=\"xx-large\")\n", + "\n", + "plt.xlabel(\"Predictions\", fontsize=18)\n", + "plt.ylabel(\"Actuals\", fontsize=18)\n", + "plt.title(\"Confusion Matrix\", fontsize=18)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "17e29efd", + "metadata": {}, + "outputs": [], + "source": [ + "# Measure the prediction results quantitatively.\n", + "eval_accuracy_tab = accuracy_score(ground_truth_label.values, predict_label_tab)\n", + "eval_f1_tab = f1_score(ground_truth_label.values, predict_label_tab)\n", + "eval_auc_tab = roc_auc_score(ground_truth_label.values, predict_prob_tab[:, 1])\n", + "\n", + "tab_results = pd.DataFrame.from_dict(\n", + " {\n", + " \"Accuracy\": eval_accuracy_tab,\n", + " \"F1\": eval_f1_tab,\n", + " \"AUC\": eval_auc_tab,\n", + " },\n", + " orient=\"index\",\n", + " columns=[\"TabTransformer with AMT\"],\n", + ")\n", + "\n", + "results_lab_cat_tab = pd.concat([results_lab_cat, tab_results], axis=1)\n", + "results_lab_cat_tab" + ] + }, + { + "cell_type": "markdown", + "id": "ea964d81", + "metadata": {}, + "source": [ + "## 6. Train An AutoGluon-Tabular model" + ] + }, + { + "cell_type": "markdown", + "id": "2c1fd4df", + "metadata": {}, + "source": [ + "### 6.1. Train with AutoGluon-Tabular model\n" + ] + }, + { + "cell_type": "markdown", + "id": "d9c6393b", + "metadata": {}, + "source": [ + "Retrieve Training Artifacts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b247833f", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import image_uris, model_uris, script_uris\n", + "\n", + "# Currently, not all the object detection models in jumpstart support finetuning. Thus, we manually select a model\n", + "# which supports finetuning.\n", + "train_model_id, train_model_version, train_scope = (\n", + " \"autogluon-classification-ensemble\",\n", + " \"*\",\n", + " \"training\",\n", + ")\n", + "training_instance_type = \"ml.g4dn.2xlarge\" # set a different GPU type to avoid instance insufficiency for p3 instance that is used by TabTransformer\n", + "\n", + "# Retrieve the docker image\n", + "train_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " image_scope=train_scope,\n", + " instance_type=training_instance_type,\n", + ")\n", + "# Retrieve the training script\n", + "train_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=train_scope\n", + ")\n", + "# Retrieve the pre-trained model tarball to further fine-tune\n", + "train_model_uri = model_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, model_scope=train_scope\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c288f5a8", + "metadata": {}, + "source": [ + "Set training parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "577586e1", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import hyperparameters\n", + "\n", + "# Retrieve the default hyper-parameters for fine-tuning the model\n", + "hyperparameters = hyperparameters.retrieve_default(\n", + " model_id=train_model_id, model_version=train_model_version\n", + ")\n", + "\n", + "hyperparameters[\"eval_metric\"] = \"roc_auc\"\n", + "print(hyperparameters)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55b4b386", + "metadata": {}, + "outputs": [], + "source": [ + "s3_output_location_ag = f\"s3://{bucket}/{output_prefix}/output_ag\"" + ] + }, + { + "cell_type": "markdown", + "id": "278f7178", + "metadata": {}, + "source": [ + "Start training\n", + "\n", + "Note. We do not perform automatic model tuning as AutoGluon-Tabular do not focus on hyperparameter selections. Instead, it ensembles multiple models and stacks them in multiple layers. For details, see [paper](https://arxiv.org/abs/2003.06505)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c07b6103", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.estimator import Estimator\n", + "from sagemaker.utils import name_from_base\n", + "\n", + "training_job_name = name_from_base(f\"jumpstart-{train_model_id}-training\")\n", + "\n", + "# Create SageMaker Estimator instance\n", + "tabular_estimator_ag = Estimator(\n", + " role=aws_role,\n", + " image_uri=train_image_uri,\n", + " source_dir=train_source_uri,\n", + " model_uri=train_model_uri,\n", + " entry_point=\"transfer_learning.py\",\n", + " instance_count=1,\n", + " instance_type=training_instance_type,\n", + " max_run=360000,\n", + " hyperparameters=hyperparameters,\n", + " output_path=s3_output_location_ag,\n", + ")\n", + "\n", + "\n", + "# Launch a SageMaker Training job by passing s3 path of the training data\n", + "tabular_estimator_ag.fit(\n", + " {\"training\": training_dataset_s3_path}, logs=True, job_name=training_job_name\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d6b8361d", + "metadata": {}, + "source": [ + "### 6.2. Deploy and Run Inference on the Trained Tabular Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6dc44a3", + "metadata": {}, + "outputs": [], + "source": [ + "inference_instance_type = \"ml.m5.2xlarge\"\n", + "\n", + "# Retrieve the inference docker container uri\n", + "deploy_image_uri = image_uris.retrieve(\n", + " region=None,\n", + " framework=None,\n", + " image_scope=\"inference\",\n", + " model_id=train_model_id,\n", + " model_version=train_model_version,\n", + " instance_type=inference_instance_type,\n", + ")\n", + "# Retrieve the inference script uri\n", + "deploy_source_uri = script_uris.retrieve(\n", + " model_id=train_model_id, model_version=train_model_version, script_scope=\"inference\"\n", + ")\n", + "\n", + "endpoint_name_ag = name_from_base(f\"jumpstart-ag-churn-{train_model_id}-\")\n", + "\n", + "# Use the estimator from the previous step to deploy to a SageMaker endpoint\n", + "predictor_ag = tabular_estimator_ag.deploy(\n", + " initial_instance_count=1,\n", + " instance_type=inference_instance_type,\n", + " entry_point=\"inference.py\",\n", + " image_uri=deploy_image_uri,\n", + " source_dir=deploy_source_uri,\n", + " endpoint_name=endpoint_name_ag,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5cf7b37", + "metadata": {}, + "outputs": [], + "source": [ + "# split the test data into smaller size of batches to query the endpoint if the test data has large size.\n", + "batch_size = 1500\n", + "predict_prob_ag = []\n", + "for i in np.arange(0, num_examples, step=batch_size):\n", + " query_response_batch = query_endpoint(\n", + " features.iloc[i : (i + batch_size), :].to_csv(header=False, index=False).encode(\"utf-8\"),\n", + " endpoint_name_ag,\n", + " )\n", + " predict_prob_batch = parse_response(query_response_batch) # prediction probability per batch\n", + " predict_prob_ag.append(predict_prob_batch)\n", + "\n", + "\n", + "predict_prob_ag = np.concatenate(predict_prob_ag, axis=0)\n", + "predict_label_ag = np.argmax(predict_prob_ag, axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d86ccb2", + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize the predictions results by plotting the confusion matrix.\n", + "conf_matrix = confusion_matrix(y_true=ground_truth_label.values, y_pred=predict_label_ag)\n", + "fig, ax = plt.subplots(figsize=(7.5, 7.5))\n", + "ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)\n", + "for i in range(conf_matrix.shape[0]):\n", + " for j in range(conf_matrix.shape[1]):\n", + " ax.text(x=j, y=i, s=conf_matrix[i, j], va=\"center\", ha=\"center\", size=\"xx-large\")\n", + "\n", + "plt.xlabel(\"Predictions\", fontsize=18)\n", + "plt.ylabel(\"Actuals\", fontsize=18)\n", + "plt.title(\"Confusion Matrix\", fontsize=18)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2bfbab51", + "metadata": {}, + "outputs": [], + "source": [ + "# Measure the prediction results quantitatively.\n", + "eval_accuracy_ag = accuracy_score(ground_truth_label.values, predict_label_ag)\n", + "eval_f1_ag = f1_score(ground_truth_label.values, predict_label_ag)\n", + "eval_auc_ag = roc_auc_score(ground_truth_label.values, predict_prob_ag[:, 1])\n", + "\n", + "ag_results = pd.DataFrame.from_dict(\n", + " {\n", + " \"Accuracy\": eval_accuracy_ag,\n", + " \"F1\": eval_f1_ag,\n", + " \"AUC\": eval_auc_ag,\n", + " },\n", + " orient=\"index\",\n", + " columns=[\"AutoGluon-Tabular\"],\n", + ")\n", + "\n", + "results_lab_cat_tab_ag = pd.concat([results_lab_cat_tab, ag_results], axis=1)\n", + "results_lab_cat_tab_ag" + ] + }, + { + "cell_type": "markdown", + "id": "cbbfc102", + "metadata": {}, + "source": [ + "## 7. Compare Prediction Results of Four Trained Models on the Same Test Data\n", + "\n", + "For the three evaluation metrics accuracy, f1 score, and roc_auc, larger value indicates better results. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25ebee1c", + "metadata": {}, + "outputs": [], + "source": [ + "results_lab_cat_tab_ag" + ] + }, + { + "cell_type": "markdown", + "id": "b7f3a1eb", + "metadata": {}, + "source": [ + "Now you can use this template to evaluate the performance of LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular on your own dataset." + ] + }, + { + "cell_type": "markdown", + "id": "95194916", + "metadata": {}, + "source": [ + "---\n", + "Next, we delete the endpoint corresponding to the trained model.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8491a547", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete the SageMaker endpoint and the attached resources\n", + "predictor.delete_model()\n", + "predictor.delete_endpoint()\n", + "predictor_cat.delete_model()\n", + "predictor_cat.delete_endpoint()\n", + "predictor_tab.delete_model()\n", + "predictor_tab.delete_endpoint()\n", + "predictor_ag.delete_model()\n", + "predictor_ag.delete_endpoint()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_python3", + "language": "python", + "name": "conda_python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}