diff --git a/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb b/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb index d0922573cf..78139639dd 100644 --- a/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb +++ b/prep_data/tabular_data/train_featurize_train_tabular_data.ipynb @@ -4,20 +4,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Preprocessing Tabular Data\n", + "# Preprocessing Tabular Data\n", "\n", - "The purpose of this notebook is to demonstrate how to preprocess tabular data for training a machine learning model via Amazon SageMaker. In this notebook we focus on preprocessing our tabular data. In a sequel notebook, [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) we use our preprocessed tabular data to select important features and prune unimportant ones out. In our final sequel notebook, [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb) we use our selected features to train a machine learning model. We showcase how to preprocess 2 different tabular data sets. \n", + "In this notebook, we focus on preprocessing tabular data. Then, we use our preprocessed tabular data to select important features and prune unimportant ones out. Finally, we use our selected features to train a machine learning model. We showcase how to preprocess 2 different tabular data sets. \n", "\n", + "## Contents\n", + "1. [Part 1: Download and Process the Dataset](#Part-1:-Download-and-Process-the-Dataset)\n", + "1. [Part 2: Feature Selection for Tabular Data](#Part-2:-Feature-Selection-for-Tabular-Data)\n", + "1. [Part 3: Training a Model on Tabular Data using Amazon SageMaker](#Part-3:-Training-a-Model-on-Tabular-Data-using-Amazon-SageMaker)\n", "\n", - "#### Notes\n", - "In this notebook, we use the sklearn framework for data partitionining and `storemagic` to share dataframes in [02_feature_selection_tabular_data.ipynb](02_feature_selection_tabular_data.ipynb) and [03_training_model_on_tabular_data.ipynb](03_training_model_on_tabular_data.ipynb). While we load data into memory here we do note that is it possible to skip this and load your partitioned data directly to an S3 bucket.\n", + "## Dataset and Package Dependencies\n", "\n", - "#### Tabular Data Sets\n", - "* [california house data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", - "* [diabetes data ](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)\n", + "### Tabular Data Sets\n", + "* [California House Dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)\n", + "* [Diabetes Dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)\n", "\n", "\n", - "#### Library Dependencies:\n", + "### Library Dependencies:\n", "* sagemaker>=2.15.0\n", "* numpy \n", "* pandas\n", @@ -31,7 +34,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Setting up the notebook" + "## Setting up the notebook" ] }, { @@ -96,6 +99,15 @@ "print(role)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 1: Download and Process the Dataset\n", + "\n", + "This section demonstrates how to preprocess tabular data for training a machine learning model via Amazon SageMaker" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -248,9 +260,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Feature Selection for Tabular Data\n", + "## Part 2: Feature Selection for Tabular Data\n", "\n", - "The purpose of this notebook is to demonstrate how to select important features and prune unimportant ones prior to training our machine learning model. This is an important step that yields better prediction performance. " + "This section demonstrates how to select important features and prune unimportant ones prior to training our machine learning model. This is an important step that yields better prediction performance. " ] }, { @@ -462,9 +474,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Training a Model on Tabular Data using Amazon SageMaker\n", + "## Part 3: Training a Model on Tabular Data using Amazon SageMaker\n", "\n", - "The purpose of this notebook is to demonstrate how to train a machine learning model via Amazon SageMaker using tabular data. In this notebook you can train either an XGBoost or Linear Learner (regression) model on tabular data in Amazon SageMaker. \n" + "This section demonstrates how to train a machine learning model via Amazon SageMaker using tabular data. You can train either an XGBoost or Linear Learner (regression) model on tabular data in Amazon SageMaker. \n" ] }, {