diff --git a/prep_data/tabular_data/01_preprocessing_tabular_data.ipynb b/prep_data/tabular_data/01_preprocessing_tabular_data.ipynb deleted file mode 100644 index 529214a1fd..0000000000 --- a/prep_data/tabular_data/01_preprocessing_tabular_data.ipynb +++ /dev/null @@ -1,281 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Preprocessing Tabular Data\n", - "\n", - "The purpose of this notebook is to demonstrate how to preprocess tabular data for training a machine learning model via Amazon SageMaker. In this notebook we focus on preprocessing our tabular data. In a sequel notebook, [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) we use our preprocessed tabular data to select important features and prune unimportant ones out. In our final sequel notebook, [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb) we use our selected features to train a machine learning model. We showcase how to preprocess 2 different tabular data sets. \n", - "\n", - "\n", - "#### Notes\n", - "In this notebook, we use the sklearn framework for data partitionining and `storemagic` to share dataframes in [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) and [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb). While we load data into memory here we do note that is it possible to skip this and load your partitioned data directly to an S3 bucket.\n", - "\n", - "#### Tabular Data Sets\n", - "* [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", - "* [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)\n", - "\n", - "\n", - "#### Library Dependencies:\n", - "* sagemaker>=2.15.0\n", - "* numpy \n", - "* pandas\n", - "* plotly\n", - "* sklearn \n", - "* matplotlib \n", - "* seaborn\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Setting up the notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import sys\n", - "import subprocess\n", - "import pkg_resources\n", - "\n", - "\n", - "def get_sagemaker_version():\n", - " \"Return the version of 'sagemaker' in your kernel or -1 if 'sagemaker' is not installed\"\n", - " for i in pkg_resources.working_set:\n", - " if i.key == \"sagemaker\":\n", - " return \"%s==%s\" % (i.key, i.version)\n", - " return -1\n", - "\n", - "\n", - "# Store original 'sagemaker' version\n", - "sagemaker_version = get_sagemaker_version()\n", - "\n", - "# Install any missing dependencies\n", - "!{sys.executable} -m pip install -qU 'plotly' 'sagemaker>=2.15.0'\n", - "\n", - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "from sklearn.datasets import *\n", - "import sklearn.model_selection\n", - "\n", - "# SageMaker dependencies\n", - "import sagemaker\n", - "from sagemaker import get_execution_role\n", - "from sagemaker.image_uris import retrieve\n", - "\n", - "# This instantiates a SageMaker session that we will be operating in.\n", - "session = sagemaker.Session()\n", - "\n", - "# This object represents the IAM role that we are assigned.\n", - "role = sagemaker.get_execution_role()\n", - "print(role)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 1: Select and Download Data\n", - "\n", - "Here you can select the tabular data set of your choice to preprocess." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_sets = {\"diabetes\": \"load_diabetes()\", \"california\": \"fetch_california_housing()\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To do select a particular dataset, assign **choosen_data_set** below to be one of 'diabetes', or 'california' where each name corresponds to the it's respective dataset.\n", - "\n", - "* 'california' : [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", - "* 'diabetes' : [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Change choosen_data_set variable to one of the data sets above.\n", - "choosen_data_set = \"california\"\n", - "assert choosen_data_set in data_sets.keys()\n", - "print(\"I selected the '{}' dataset!\".format(choosen_data_set))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 2: Describe Feature Information \n", - "\n", - "Here you can select the tabular data set of your choice to preprocess." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_set = eval(data_sets[choosen_data_set])\n", - "\n", - "X = pd.DataFrame(data_set.data, columns=data_set.feature_names)\n", - "Y = pd.DataFrame(data_set.target)\n", - "\n", - "print(\"Features:\", list(X.columns))\n", - "print(\"Dataset shape:\", X.shape)\n", - "print(\"Dataset Type:\", type(X))\n", - "print(\"Label set shape:\", Y.shape)\n", - "print(\"Label set Type:\", type(X))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### We describe both our training data inputs X and outputs Y by computing the count, mean, std, min, percentiles. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "display(X.describe())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "display(Y.describe())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 3: Plot on Feature Correlation\n", - "Here we show a heatmap and clustergrid across all our features. These visualizations help us analyze correlated features and are particularly important if we want to remove redundant features. The heatmap computes a similarity score across each feature and colors like features using this score. The clustergrid is similar, however it presents feature correlations hierarchically.\n", - "\n", - "**Note**: For the purposes of this notebook we do not remove any features but by gathering the findings from these plots one may choose to and can do so at this point. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "plt.figure(figsize=(14, 12))\n", - "cor = X.corr()\n", - "sns.heatmap(cor, annot=True, cmap=sns.diverging_palette(20, 220, n=200))\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cluster_map = sns.clustermap(cor, cmap=sns.diverging_palette(20, 220, n=200), linewidths=0.1)\n", - "plt.setp(cluster_map.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)\n", - "cluster_map" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 4: Partition Dataset into Train, Test, Validation Splits\n", - "Here using the sklearn framework we partition our selected dataset into Train, Test and Validation splits. We choose a partition size of 1/3 and then further split the training set into 2/3 training and 1/3 validation set. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# We partition the dataset into 2/3 training and 1/3 test set.\n", - "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.33)\n", - "\n", - "# We further split the training set into a validation set i.e., 2/3 training set, and 1/3 validation set\n", - "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(\n", - " X_train, Y_train, test_size=0.33\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 5: Store Variables using `storemagic`\n", - "We use storemagic to persist all relevant variables so they can be reused in our sequel notebooks, [02_feature_selection_tabular_data.ipynb ](02_feature_selection_tabular_data.ipynb ) and [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb). \n", - "\n", - "Alternatively, it is possible to upload your partitioned data to an S3 bucket and point to it during the model training phase. We note that this is beyond the scope of this notebook hence why we omit it. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Using storemagic we persist the variables below so we can access them in the 02_feature_selection_tabular_data.ipynb and training_model_on_tabular_data.ipynb\n", - "%store X_train\n", - "%store X_test\n", - "%store X_val\n", - "%store Y_train\n", - "%store Y_test\n", - "%store Y_val\n", - "%store choosen_data_set\n", - "%store sagemaker_version" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/prep_data/tabular_data/02_feature_selection_tabular_data.ipynb b/prep_data/tabular_data/02_feature_selection_tabular_data.ipynb deleted file mode 100644 index 5f78cb98d5..0000000000 --- a/prep_data/tabular_data/02_feature_selection_tabular_data.ipynb +++ /dev/null @@ -1,354 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Feature Selection for Tabular Data\n", - "\n", - "The purpose of this notebook is to demonstrate how to select important features and prune unimportant ones prior to training our machine learning model. This is an important step that yields better prediction performance. \n", - "\n", - "#### Prerequisite\n", - "This notebook is a sequel to the [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) notebook. Before running this notebook, run [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) to preprocess the data used in this notebook. \n", - "\n", - "#### Notes\n", - "In this notebook, we use the sklearn framework for data partitionining and `storemagic` to share dataframes in [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb). While we load data into memory here we do note that is it possible to skip this and load your partitioned data directly to an S3 bucket.\n", - "\n", - "#### Tabular Data Sets\n", - "* [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", - "* [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)\n", - "\n", - "\n", - "#### Library Dependencies:\n", - "* sagemaker >= 2.0.0\n", - "* numpy \n", - "* pandas\n", - "* plotly\n", - "* sklearn \n", - "* matplotlib \n", - "* seaborn\n", - "* xgboost" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Setting up the notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import sys\n", - "import plotly.express as px\n", - "import plotly.offline as pyo\n", - "import plotly.graph_objs as go\n", - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "import pickle\n", - "import ast\n", - "from matplotlib import pyplot\n", - "\n", - "## sklearn dependencies\n", - "from sklearn.datasets import make_regression\n", - "\n", - "import sklearn.model_selection\n", - "from sklearn.neighbors import KNeighborsRegressor\n", - "from sklearn.inspection import permutation_importance\n", - "\n", - "!{sys.executable} -m pip install -qU 'xgboost'\n", - "import xgboost\n", - "from xgboost import XGBRegressor\n", - "\n", - "## SageMaker dependencies\n", - "import sagemaker\n", - "from sagemaker import get_execution_role\n", - "from sagemaker.inputs import TrainingInput\n", - "from sagemaker.image_uris import retrieve\n", - "\n", - "## This instantiates a SageMaker session that we will be operating in.\n", - "session = sagemaker.Session()\n", - "\n", - "## This object represents the IAM role that we are assigned.\n", - "role = sagemaker.get_execution_role()\n", - "print(role)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 1: Load Relevant Variables from preprocessing_tabular_data.ipynb (Required for this notebook)\n", - "Here we load in our training, test, and validation data sets. We preprocessed this data in the [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) and persisted it using storemagic." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load relevant dataframes and variables from preprocessing_tabular_data.ipynb required for this notebook\n", - "%store -r X_train\n", - "%store -r X_test\n", - "%store -r X_val" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 2: Computing Feature Importance Scores to Select Features\n", - "We show two approaches for computing feature importance scores for each feature. We can rank each feature by their corresponding feature importance score in an effort to prune unimportant features which will yield a better performing model. \n", - "\n", - "The first approach, uses XGBoost and the second uses permutation feature importance." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 2a: Ranking features by Feature Importance using XGBoost\n", - "Here we use gradient boosting to extract importance scores for each feature. The importance scores calculated for each feature inform us how useful the feature was for constructing the boosted decision tree and can be ranked and compared to one another for feature selection." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "X_data, y_label = make_regression(\n", - " n_samples=X_train.shape[0], n_features=X_train.shape[1], n_informative=10, random_state=1\n", - ")\n", - "xgboost_model = XGBRegressor()\n", - "xgboost_model.fit(X_data, y_label)\n", - "\n", - "feature_importances_xgboost = xgboost_model.feature_importances_\n", - "for index, importance_score in enumerate(feature_importances_xgboost):\n", - " print(\"Feature: {}, Score: {}\".format(X_train.columns[index], importance_score))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def create_bar_plot(feature_importances, X_train):\n", - " \"\"\"\n", - " Create a bar plot of features against their corresponding feature importance score.\n", - " \"\"\"\n", - " x_indices = [_ for _ in range(len(feature_importances))]\n", - " plt.figure(figsize=(15, 5))\n", - " plt.bar(x_indices, feature_importances, color=\"blue\")\n", - " plt.xticks(x_indices, X_train.columns)\n", - " plt.xlabel(\"Feature\", fontsize=18)\n", - " plt.ylabel(\"Importance Score\", fontsize=18)\n", - " plt.title(\"Feature Importance Scores\", fontsize=18)\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "create_bar_plot(feature_importances_xgboost, X_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the following cell, we rank each feature based on corresponding importance score." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def show_ranked_feature_importance_list(scores, data):\n", - " \"\"\"\n", - " Prints the features ranked by their corresponding importance score.\n", - " \"\"\"\n", - " lst = list(zip(data.columns, scores))\n", - " ranked_lst = sorted(lst, key=lambda t: t[1], reverse=True)\n", - " print(pd.DataFrame(ranked_lst, columns=[\"Feature\", \"Importance Score\"]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "show_ranked_feature_importance_list(feature_importances_xgboost, X_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 2b: Ranking features by Permutation Feature Importance using the Scikit-learn k-NN Algorithm\n", - "This approach is commonly used for selecting features in tabular data. We first randomly shuffle a single feature value and train a model. In this example we use the k-nearest-neighbours algorithm to train our model. The permutation feature importance score is the decrease in models score when this single feature value is shuffled. The decrease in the model score is representative of how dependant the model is on the feature. This technique can be computed many times with altering permutations per feature. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "X_data, y_label = make_regression(\n", - " n_samples=X_train.shape[0], n_features=X_train.shape[1], n_informative=10, random_state=1\n", - ")\n", - "k_nn_model = KNeighborsRegressor()\n", - "k_nn_model.fit(X_data, y_label)\n", - "feature_importances_permutations = permutation_importance(\n", - " k_nn_model, X_data, y_label, scoring=\"neg_mean_squared_error\"\n", - ").importances_mean\n", - "\n", - "for index, importance_score in enumerate(feature_importances_permutations):\n", - " print(\"Feature: {}, Score: {}\".format(X_train.columns[index], importance_score))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "create_bar_plot(feature_importances_permutations, X_train)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "show_ranked_feature_importance_list(feature_importances_permutations, X_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 3: Prune Unimportant Features\n", - "Thus far, we have discussed two common approaches for obtaining a ranked list of feature importance scores for each feature. From these lists we can infer unimportant features based on their importance scores and can eliminate them from our training, validation and test sets. For example, if feature A has a higher importance score then feature B's importance score, then this implies that feature A is more important then feature B and vice versa. We mention that both approaches constrain the removal of features to the dataset itself which is independent of the problem domain." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "After selecting your desired approach, move onto the next cell to prune features that have the importance score less than or equal to a threshold value. Depending on the approach of your choice and the distribution of scores, the `threshold` value may vary.\n", - "\n", - "In this example, we select the first approach with XGBoost and set the threshold value to 0.01." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "threshold = 0.01" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def remove_features(lst, data, threshold):\n", - " \"\"\"\n", - " Remove features found in lst from data iff its importance score is below threshold.\n", - " \"\"\"\n", - " features_to_remove = []\n", - " for index, pair in enumerate(list(zip(data.columns, lst))):\n", - " if pair[1] <= threshold:\n", - " features_to_remove.append(pair[0])\n", - "\n", - " if features_to_remove:\n", - " data.drop(features_to_remove, axis=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Assign `lst` to be `feature_importances_permutations` or `feature_importances_xgboost` if want to use the ranked list from that uses XGBoost or permutation feature importance respectively.\n", - "\n", - "We remove all features that are below `threshold` from our training data, `X_train`, validation data, `X_val` and testing data `X_test` respectively. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "remove_features(lst=feature_importances_xgboost, data=X_train, threshold=threshold)\n", - "remove_features(lst=feature_importances_xgboost, data=X_val, threshold=threshold)\n", - "remove_features(lst=feature_importances_xgboost, data=X_test, threshold=threshold)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 4: Store Variables using `storemagic`\n", - "After pruning the unimportant features, use `storemagic` to persist all relevant variables so that they can be reused in our next sequel notebook, [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb), where we focus on model training. \n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Using storemagic we persist the variables below so we can access them in 03_training_model_on_tabular_data.ipynb\n", - "%store X_train\n", - "%store X_test\n", - "%store X_val" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/prep_data/tabular_data/03_training_model_on_tabular_data.ipynb b/prep_data/tabular_data/03_training_model_on_tabular_data.ipynb deleted file mode 100644 index 603d54bef1..0000000000 --- a/prep_data/tabular_data/03_training_model_on_tabular_data.ipynb +++ /dev/null @@ -1,297 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Training a Model on Tabular Data using Amazon SageMaker\n", - "\n", - "The purpose of this notebook is to demonstrate how to train a machine learning model via Amazon SageMaker using tabular data. In this notebook you can train either an XGBoost or Linear Learner (regression) model on tabular data in Amazon SageMaker. \n", - "\n", - "#### Prerequisite\n", - "This notebook is a sequel to the [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) and [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) notebooks. Before running this notebook, run [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) to preprocess the data used in this notebook and then [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) that selects important features.\n", - "\n", - "#### Tabular Data Sets\n", - "* [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", - "* [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)\n", - "\n", - "\n", - "#### Library Dependencies:\n", - "* sagemaker >= 2.0.0\n", - "* numpy \n", - "* pandas\n", - "* plotly\n", - "* sklearn \n", - "* matplotlib \n", - "* seaborn" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Setting up the notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import sys\n", - "import plotly.express as px\n", - "import plotly.offline as pyo\n", - "import plotly.graph_objs as go\n", - "import numpy as np\n", - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns\n", - "import pickle\n", - "\n", - "from sklearn.datasets import *\n", - "import ast\n", - "import sklearn.model_selection\n", - "\n", - "## SageMaker dependencies\n", - "import sagemaker\n", - "from sagemaker import get_execution_role\n", - "from sagemaker.inputs import TrainingInput\n", - "from sagemaker.image_uris import retrieve\n", - "\n", - "## This instantiates a SageMaker session that we will be operating in.\n", - "session = sagemaker.Session()\n", - "\n", - "## This object represents the IAM role that we are assigned.\n", - "role = sagemaker.get_execution_role()\n", - "print(role)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 1: Load Relevant Variables from preprocessing_tabular_data.ipynb and feature_engineering_tabular_data.ipynb (Required for this notebook)\n", - "Here we load in our training, test, and validation data sets. We preprocessed this data in the [01_preprocessing_tabular_data.ipynb](01_preprocessing_tabular_data.ipynb) and did feature selection [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) in persisted it using storemagic." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load relevant dataframes and variables from 01_preprocessing_tabular_data.ipynb required for this notebook\n", - "%store -r X_train\n", - "%store -r X_test\n", - "%store -r X_val\n", - "%store -r Y_train\n", - "%store -r Y_test\n", - "%store -r Y_val\n", - "%store -r choosen_data_set\n", - "%store -r sagemaker_version" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 2: Uploading the data to S3\n", - "Here we upload our training and validation data to an S3 bucket. This is a critical step because we will be specifying this S3 bucket's location during the training step. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_dir = \"../data/\" + choosen_data_set\n", - "if not os.path.exists(data_dir):\n", - " os.makedirs(data_dir)\n", - "\n", - "prefix = choosen_data_set + \"-deploy-hl\"\n", - "pd.concat([Y_train, X_train], axis=1).to_csv(\n", - " os.path.join(data_dir, \"train.csv\"), header=False, index=False\n", - ")\n", - "pd.concat([Y_val, X_val], axis=1).to_csv(\n", - " os.path.join(data_dir, \"validation.csv\"), header=False, index=False\n", - ")\n", - "\n", - "val_location = session.upload_data(os.path.join(data_dir, \"validation.csv\"), key_prefix=prefix)\n", - "train_location = session.upload_data(os.path.join(data_dir, \"train.csv\"), key_prefix=prefix)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here we have a pointer to our training and validation data sets stored in an S3 bucket. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "s3_input_train = TrainingInput(s3_data=train_location, content_type=\"text/csv\")\n", - "s3_input_validation = TrainingInput(s3_data=val_location, content_type=\"text/csv\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 3: Select and Train the Model\n", - "Select between the XGBoost or Linear Learner algorithm by assigning model_selected to either 'xgboost' or 'linear-learner'." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Select between xgboost or linear-learner (regression)\n", - "models = [\"xgboost\", \"linear-learner\"]\n", - "model_selected = \"xgboost\"\n", - "assert model_selected in models\n", - "print(\"Selected model:\", model_selected)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here we retrieve our container and instantiate our model object using the Estimator class." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "container = retrieve(framework=model_selected, region=session.boto_region_name, version=\"latest\")\n", - "\n", - "model = sagemaker.estimator.Estimator(\n", - " container,\n", - " role,\n", - " instance_count=1,\n", - " instance_type=\"ml.m4.xlarge\",\n", - " output_path=\"s3://{}/{}/output\".format(session.default_bucket(), prefix),\n", - " sagemaker_session=session,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 4: Set hyperparameters\n", - "Thus far, we have instantiated our model with our container and uploaded our preprocessed data to our S3 bucket. \n", - "Next, we set our hyperparameters for our choosen model. We note that both [XGBoost](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/xgboost_hyperparameters.html) and [linear learner](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/ll_hyperparameters.html) have different hyperparameters that can be set." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if model_selected == \"xgboost\":\n", - " model.set_hyperparameters(\n", - " max_depth=5,\n", - " eta=0.2,\n", - " gamma=4,\n", - " min_child_weight=6,\n", - " subsample=0.8,\n", - " objective=\"reg:linear\",\n", - " early_stopping_rounds=10,\n", - " num_round=1,\n", - " )\n", - "\n", - "if model_selected == \"linear-learner\":\n", - " model.set_hyperparameters(\n", - " feature_dim=X_train.shape[1], predictor_type=\"regressor\", mini_batch_size=100\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our estimator object is instantiated with hyperparameter settings, now it is time to train! To do this we specify our S3 bucket's location that is storing our training data and validation data and pass it via a dictionary to the fit method. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model.fit({\"train\": s3_input_train, \"validation\": s3_input_validation}, wait=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Step 6: Save Trained Model\n", - "The model has been trained. Below we select and download the model we just trained above. \n", - "\n", - "To download the last trained model we assign the `s3_uri` parameter to be `model.model_data`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sagemaker.s3.S3Downloader.download(s3_uri=model.model_data, local_path=\"./\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Below we safe guard your kernel environment by installing your original sagemaker version." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if sagemaker_version != None:\n", - " !{sys.executable} -m pip install -qU sagemaker_version" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb b/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb new file mode 100644 index 0000000000..d0922573cf --- /dev/null +++ b/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb @@ -0,0 +1,661 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preprocessing Tabular Data\n", + "\n", + "The purpose of this notebook is to demonstrate how to preprocess tabular data for training a machine learning model via Amazon SageMaker. In this notebook we focus on preprocessing our tabular data. In a sequel notebook, [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) we use our preprocessed tabular data to select important features and prune unimportant ones out. In our final sequel notebook, [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb) we use our selected features to train a machine learning model. We showcase how to preprocess 2 different tabular data sets. \n", + "\n", + "\n", + "#### Notes\n", + "In this notebook, we use the sklearn framework for data partitionining and `storemagic` to share dataframes in [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) and [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb). While we load data into memory here we do note that is it possible to skip this and load your partitioned data directly to an S3 bucket.\n", + "\n", + "#### Tabular Data Sets\n", + "* [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", + "* [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)\n", + "\n", + "\n", + "#### Library Dependencies:\n", + "* sagemaker>=2.15.0\n", + "* numpy \n", + "* pandas\n", + "* plotly\n", + "* sklearn \n", + "* matplotlib \n", + "* seaborn\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting up the notebook" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "import subprocess\n", + "import pkg_resources\n", + "\n", + "\n", + "def get_sagemaker_version():\n", + " \"Return the version of 'sagemaker' in your kernel or -1 if 'sagemaker' is not installed\"\n", + " for i in pkg_resources.working_set:\n", + " if i.key == \"sagemaker\":\n", + " return \"%s==%s\" % (i.key, i.version)\n", + " return -1\n", + "\n", + "\n", + "# Store original 'sagemaker' version\n", + "sagemaker_version = get_sagemaker_version()\n", + "\n", + "# Install any missing dependencies\n", + "!{sys.executable} -m pip install -qU 'plotly' 'sagemaker>=2.15.0'\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib import pyplot\n", + "import seaborn as sns\n", + "import pickle\n", + "import ast\n", + "import plotly.express as px\n", + "import plotly.offline as pyo\n", + "import plotly.graph_objs as go\n", + "\n", + "from sklearn.datasets import *\n", + "import sklearn.model_selection\n", + "from sklearn.datasets import make_regression\n", + "import sklearn.model_selection\n", + "from sklearn.neighbors import KNeighborsRegressor\n", + "from sklearn.inspection import permutation_importance\n", + "\n", + "!{sys.executable} -m pip install -qU 'xgboost'\n", + "import xgboost\n", + "from xgboost import XGBRegressor\n", + "\n", + "# SageMaker dependencies\n", + "import sagemaker\n", + "from sagemaker import get_execution_role\n", + "from sagemaker.inputs import TrainingInput\n", + "from sagemaker.image_uris import retrieve\n", + "\n", + "# This instantiates a SageMaker session that we will be operating in.\n", + "session = sagemaker.Session()\n", + "\n", + "# This object represents the IAM role that we are assigned.\n", + "role = sagemaker.get_execution_role()\n", + "print(role)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Select and Download Data\n", + "\n", + "Here you can select the tabular data set of your choice to preprocess." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_sets = {\"diabetes\": \"load_diabetes()\", \"california\": \"fetch_california_housing()\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To do select a particular dataset, assign **choosen_data_set** below to be one of 'diabetes', or 'california' where each name corresponds to the it's respective dataset.\n", + "\n", + "* 'california' : [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", + "* 'diabetes' : [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Change choosen_data_set variable to one of the data sets above.\n", + "choosen_data_set = \"california\"\n", + "assert choosen_data_set in data_sets.keys()\n", + "print(\"I selected the '{}' dataset!\".format(choosen_data_set))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Describe Feature Information \n", + "\n", + "Here you can select the tabular data set of your choice to preprocess." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_set = eval(data_sets[choosen_data_set])\n", + "\n", + "X = pd.DataFrame(data_set.data, columns=data_set.feature_names)\n", + "Y = pd.DataFrame(data_set.target)\n", + "\n", + "print(\"Features:\", list(X.columns))\n", + "print(\"Dataset shape:\", X.shape)\n", + "print(\"Dataset Type:\", type(X))\n", + "print(\"Label set shape:\", Y.shape)\n", + "print(\"Label set Type:\", type(X))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### We describe both our training data inputs X and outputs Y by computing the count, mean, std, min, percentiles. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display(X.describe())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display(Y.describe())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 3: Plot on Feature Correlation\n", + "Here we show a heatmap and clustergrid across all our features. These visualizations help us analyze correlated features and are particularly important if we want to remove redundant features. The heatmap computes a similarity score across each feature and colors like features using this score. The clustergrid is similar, however it presents feature correlations hierarchically.\n", + "\n", + "**Note**: For the purposes of this notebook we do not remove any features but by gathering the findings from these plots one may choose to and can do so at this point. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(14, 12))\n", + "cor = X.corr()\n", + "sns.heatmap(cor, annot=True, cmap=sns.diverging_palette(20, 220, n=200))\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cluster_map = sns.clustermap(cor, cmap=sns.diverging_palette(20, 220, n=200), linewidths=0.1)\n", + "plt.setp(cluster_map.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)\n", + "cluster_map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Partition Dataset into Train, Test, Validation Splits\n", + "Here using the sklearn framework we partition our selected dataset into Train, Test and Validation splits. We choose a partition size of 1/3 and then further split the training set into 2/3 training and 1/3 validation set. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# We partition the dataset into 2/3 training and 1/3 test set.\n", + "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.33)\n", + "\n", + "# We further split the training set into a validation set i.e., 2/3 training set, and 1/3 validation set\n", + "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(\n", + " X_train, Y_train, test_size=0.33\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Feature Selection for Tabular Data\n", + "\n", + "The purpose of this notebook is to demonstrate how to select important features and prune unimportant ones prior to training our machine learning model. This is an important step that yields better prediction performance. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Computing Feature Importance Scores to Select Features\n", + "We show two approaches for computing feature importance scores for each feature. We can rank each feature by their corresponding feature importance score in an effort to prune unimportant features which will yield a better performing model. \n", + "\n", + "The first approach, uses XGBoost and the second uses permutation feature importance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1a: Ranking features by Feature Importance using XGBoost\n", + "Here we use gradient boosting to extract importance scores for each feature. The importance scores calculated for each feature inform us how useful the feature was for constructing the boosted decision tree and can be ranked and compared to one another for feature selection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X_data, y_label = make_regression(\n", + " n_samples=X_train.shape[0], n_features=X_train.shape[1], n_informative=10, random_state=1\n", + ")\n", + "xgboost_model = XGBRegressor()\n", + "xgboost_model.fit(X_data, y_label)\n", + "\n", + "feature_importances_xgboost = xgboost_model.feature_importances_\n", + "for index, importance_score in enumerate(feature_importances_xgboost):\n", + " print(\"Feature: {}, Score: {}\".format(X_train.columns[index], importance_score))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_bar_plot(feature_importances, X_train):\n", + " \"\"\"\n", + " Create a bar plot of features against their corresponding feature importance score.\n", + " \"\"\"\n", + " x_indices = [_ for _ in range(len(feature_importances))]\n", + " plt.figure(figsize=(15, 5))\n", + " plt.bar(x_indices, feature_importances, color=\"blue\")\n", + " plt.xticks(x_indices, X_train.columns)\n", + " plt.xlabel(\"Feature\", fontsize=18)\n", + " plt.ylabel(\"Importance Score\", fontsize=18)\n", + " plt.title(\"Feature Importance Scores\", fontsize=18)\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "create_bar_plot(feature_importances_xgboost, X_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the following cell, we rank each feature based on corresponding importance score." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def show_ranked_feature_importance_list(scores, data):\n", + " \"\"\"\n", + " Prints the features ranked by their corresponding importance score.\n", + " \"\"\"\n", + " lst = list(zip(data.columns, scores))\n", + " ranked_lst = sorted(lst, key=lambda t: t[1], reverse=True)\n", + " print(pd.DataFrame(ranked_lst, columns=[\"Feature\", \"Importance Score\"]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "show_ranked_feature_importance_list(feature_importances_xgboost, X_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1b: Ranking features by Permutation Feature Importance using the Scikit-learn k-NN Algorithm\n", + "This approach is commonly used for selecting features in tabular data. We first randomly shuffle a single feature value and train a model. In this example we use the k-nearest-neighbours algorithm to train our model. The permutation feature importance score is the decrease in models score when this single feature value is shuffled. The decrease in the model score is representative of how dependant the model is on the feature. This technique can be computed many times with altering permutations per feature. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X_data, y_label = make_regression(\n", + " n_samples=X_train.shape[0], n_features=X_train.shape[1], n_informative=10, random_state=1\n", + ")\n", + "k_nn_model = KNeighborsRegressor()\n", + "k_nn_model.fit(X_data, y_label)\n", + "feature_importances_permutations = permutation_importance(\n", + " k_nn_model, X_data, y_label, scoring=\"neg_mean_squared_error\"\n", + ").importances_mean\n", + "\n", + "for index, importance_score in enumerate(feature_importances_permutations):\n", + " print(\"Feature: {}, Score: {}\".format(X_train.columns[index], importance_score))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "create_bar_plot(feature_importances_permutations, X_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "show_ranked_feature_importance_list(feature_importances_permutations, X_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Prune Unimportant Features\n", + "Thus far, we have discussed two common approaches for obtaining a ranked list of feature importance scores for each feature. From these lists we can infer unimportant features based on their importance scores and can eliminate them from our training, validation and test sets. For example, if feature A has a higher importance score then feature B's importance score, then this implies that feature A is more important then feature B and vice versa. We mention that both approaches constrain the removal of features to the dataset itself which is independent of the problem domain." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After selecting your desired approach, move onto the next cell to prune features that have the importance score less than or equal to a threshold value. Depending on the approach of your choice and the distribution of scores, the `threshold` value may vary.\n", + "\n", + "In this example, we select the first approach with XGBoost and set the threshold value to 0.01." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "threshold = 0.01" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def remove_features(lst, data, threshold):\n", + " \"\"\"\n", + " Remove features found in lst from data iff its importance score is below threshold.\n", + " \"\"\"\n", + " features_to_remove = []\n", + " for index, pair in enumerate(list(zip(data.columns, lst))):\n", + " if pair[1] <= threshold:\n", + " features_to_remove.append(pair[0])\n", + "\n", + " if features_to_remove:\n", + " data.drop(features_to_remove, axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Assign `lst` to be `feature_importances_permutations` or `feature_importances_xgboost` if want to use the ranked list from that uses XGBoost or permutation feature importance respectively.\n", + "\n", + "We remove all features that are below `threshold` from our training data, `X_train`, validation data, `X_val` and testing data `X_test` respectively. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "remove_features(lst=feature_importances_xgboost, data=X_train, threshold=threshold)\n", + "remove_features(lst=feature_importances_xgboost, data=X_val, threshold=threshold)\n", + "remove_features(lst=feature_importances_xgboost, data=X_test, threshold=threshold)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training a Model on Tabular Data using Amazon SageMaker\n", + "\n", + "The purpose of this notebook is to demonstrate how to train a machine learning model via Amazon SageMaker using tabular data. In this notebook you can train either an XGBoost or Linear Learner (regression) model on tabular data in Amazon SageMaker. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Uploading the data to S3\n", + "Here we upload our training and validation data to an S3 bucket. This is a critical step because we will be specifying this S3 bucket's location during the training step. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_dir = \"../data/\" + choosen_data_set\n", + "if not os.path.exists(data_dir):\n", + " os.makedirs(data_dir)\n", + "\n", + "prefix = choosen_data_set + \"-deploy-hl\"\n", + "pd.concat([Y_train, X_train], axis=1).to_csv(\n", + " os.path.join(data_dir, \"train.csv\"), header=False, index=False\n", + ")\n", + "pd.concat([Y_val, X_val], axis=1).to_csv(\n", + " os.path.join(data_dir, \"validation.csv\"), header=False, index=False\n", + ")\n", + "\n", + "val_location = session.upload_data(os.path.join(data_dir, \"validation.csv\"), key_prefix=prefix)\n", + "train_location = session.upload_data(os.path.join(data_dir, \"train.csv\"), key_prefix=prefix)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we have a pointer to our training and validation data sets stored in an S3 bucket. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s3_input_train = TrainingInput(s3_data=train_location, content_type=\"text/csv\")\n", + "s3_input_validation = TrainingInput(s3_data=val_location, content_type=\"text/csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Select and Train the Model\n", + "Select between the XGBoost or Linear Learner algorithm by assigning model_selected to either 'xgboost' or 'linear-learner'." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Select between xgboost or linear-learner (regression)\n", + "models = [\"xgboost\", \"linear-learner\"]\n", + "model_selected = \"xgboost\"\n", + "assert model_selected in models\n", + "print(\"Selected model:\", model_selected)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we retrieve our container and instantiate our model object using the Estimator class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "container = retrieve(framework=model_selected, region=session.boto_region_name, version=\"latest\")\n", + "\n", + "model = sagemaker.estimator.Estimator(\n", + " container,\n", + " role,\n", + " instance_count=1,\n", + " instance_type=\"ml.m4.xlarge\",\n", + " output_path=\"s3://{}/{}/output\".format(session.default_bucket(), prefix),\n", + " sagemaker_session=session,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 3: Set hyperparameters\n", + "Thus far, we have instantiated our model with our container and uploaded our preprocessed data to our S3 bucket. \n", + "Next, we set our hyperparameters for our choosen model. We note that both [XGBoost](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/xgboost_hyperparameters.html) and [linear learner](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/ll_hyperparameters.html) have different hyperparameters that can be set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if model_selected == \"xgboost\":\n", + " model.set_hyperparameters(\n", + " max_depth=5,\n", + " eta=0.2,\n", + " gamma=4,\n", + " min_child_weight=6,\n", + " subsample=0.8,\n", + " objective=\"reg:linear\",\n", + " early_stopping_rounds=10,\n", + " num_round=1,\n", + " )\n", + "\n", + "if model_selected == \"linear-learner\":\n", + " model.set_hyperparameters(\n", + " feature_dim=X_train.shape[1], predictor_type=\"regressor\", mini_batch_size=100\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our estimator object is instantiated with hyperparameter settings, now it is time to train! To do this we specify our S3 bucket's location that is storing our training data and validation data and pass it via a dictionary to the fit method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model.fit({\"train\": s3_input_train, \"validation\": s3_input_validation}, wait=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Save Trained Model\n", + "The model has been trained. Below we select and download the model we just trained above. \n", + "\n", + "To download the last trained model we assign the `s3_uri` parameter to be `model.model_data`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sagemaker.s3.S3Downloader.download(s3_uri=model.model_data, local_path=\"./\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "conda_python3", + "language": "python", + "name": "conda_python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}