Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Aba…

…lone notebook. (aws#3623) * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. * Addressed comments for HPO integration. Co-authored-by: Aaron Markham <[email protected]>
atqy · Oct 28, 2022 · 9afa94a · 9afa94a
1 parent 8aedf58
commit 9afa94a
Showing 1 changed file with 175 additions and 17 deletions.
diff --git a/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb b/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb
@@ -2,7 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "source": [
     "# Regression with Amazon SageMaker XGBoost algorithm\n",
     "_**Single machine training for regression with Amazon SageMaker XGBoost algorithm**_\n",
@@ -102,7 +104,13 @@
    "source": [
     "## Training the XGBoost model\n",
     "\n",
-    "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes."
+    "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes.\n",
+    "\n",
+    "Training can be done by either calling SageMaker Training with a set of hyperparameters values to train with, or by leveraging SageMaker Automatic Model Tuning ([AMT](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)). AMT, also known as hyperparameter tuning (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.\n",
+    "\n",
+    "In this notebook, both methods are used for demonstration purposes, but the model that the HPO job creates is the one that is eventually hosted. You can instead choose to deploy the model created by the standalone training job by changing the below variable `deploy_amt_model` to False.\n",
+    "\n",
+    "### Initiliazing common variables "
    ]
   },
   {
@@ -111,7 +119,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.5-1\")"
+    "container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.5-1\")\n",
+    "client = boto3.client(\"sagemaker\", region_name=region)\n",
+    "deploy_amt_model = True"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Training with SageMaker Training"
    ]
   },
   {
@@ -123,9 +140,9 @@
     "%%time\n",
     "import boto3\n",
     "from time import gmtime, strftime\n",
+    "import time\n",
     "\n",
-    "job_name = f\"DEMO-xgboost-regression-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
-    "print(\"Training job\", job_name)\n",
+    "training_job_name = f\"DEMO-xgboost-regression-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
     "\n",
     "# Ensure that the training and validation data folders generated above are reflected in the \"InputDataConfig\" parameter below.\n",
     "\n",
@@ -134,7 +151,7 @@
     "    \"RoleArn\": role,\n",
     "    \"OutputDataConfig\": {\"S3OutputPath\": f\"{output_bucket_path}/{output_prefix}/single-xgboost\"},\n",
     "    \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 5},\n",
-    "    \"TrainingJobName\": job_name,\n",
+    "    \"TrainingJobName\": training_job_name,\n",
     "    \"HyperParameters\": {\n",
     "        \"max_depth\": \"5\",\n",
     "        \"eta\": \"0.2\",\n",
@@ -174,17 +191,13 @@
     "    ],\n",
     "}\n",
     "\n",
-    "\n",
-    "client = boto3.client(\"sagemaker\", region_name=region)\n",
+    "print(f\"Creating a training job with name: {training_job_name}. It will take between 5 and 6 minutes to complete.\")\n",
     "client.create_training_job(**create_training_params)\n",
-    "\n",
-    "import time\n",
-    "\n",
-    "status = client.describe_training_job(TrainingJobName=job_name)[\"TrainingJobStatus\"]\n",
+    "status = client.describe_training_job(TrainingJobName=training_job_name)[\"TrainingJobStatus\"]\n",
     "print(status)\n",
     "while status != \"Completed\" and status != \"Failed\":\n",
     "    time.sleep(60)\n",
-    "    status = client.describe_training_job(TrainingJobName=job_name)[\"TrainingJobStatus\"]\n",
+    "    status = client.describe_training_job(TrainingJobName=training_job_name)[\"TrainingJobStatus\"]\n",
     "    print(status)"
    ]
   },
@@ -195,6 +208,146 @@
     "Note that the \"validation\" channel has been initialized too. The SageMaker XGBoost algorithm actually calculates RMSE and writes it to the CloudWatch logs on the data passed to the \"validation\" channel."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Tuning with SageMaker Automatic Model Tuning\n",
+    "\n",
+    "To create a tuning job using the AWS SageMaker Automatic Model Tuning API, you need to define 3 attributes. \n",
+    "\n",
+    "1. the tuning job name (string)\n",
+    "2. the tuning job config (to specify settings for the hyperparameter tuning job - JSON object)\n",
+    "3. training job definition (to configure the training jobs that the tuning job launches - JSON object).\n",
+    "\n",
+    "To learn more about that, refer to the [Configure and Launch a Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html) documentation.\n",
+    "\n",
+    "Note that the tuning job will 12-17 minutes to complete."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from time import gmtime, strftime, sleep\n",
+    "\n",
+    "tuning_job_name = \"DEMO-xgboost-reg-\" + strftime(\"%d-%H-%M-%S\", gmtime())\n",
+    "\n",
+    "tuning_job_config = {\n",
+    "    \"ParameterRanges\": {\n",
+    "        \"CategoricalParameterRanges\": [],\n",
+    "        \"ContinuousParameterRanges\": [\n",
+    "            {\n",
+    "                \"MaxValue\": \"0.5\",\n",
+    "                \"MinValue\": \"0.1\",\n",
+    "                \"Name\": \"eta\",\n",
+    "            },\n",
+    "            {\n",
+    "                \"MaxValue\": \"5\",\n",
+    "                \"MinValue\": \"0\",\n",
+    "                \"Name\": \"gamma\",\n",
+    "            },\n",
+    "            {\n",
+    "                \"MaxValue\": \"120\",\n",
+    "                \"MinValue\": \"0\",\n",
+    "                \"Name\": \"min_child_weight\",\n",
+    "            },\n",
+    "            {\n",
+    "                \"MaxValue\": \"1\",\n",
+    "                \"MinValue\": \"0.5\",\n",
+    "                \"Name\": \"subsample\",\n",
+    "            },\n",
+    "            {\n",
+    "                \"MaxValue\": \"2\",\n",
+    "                \"MinValue\": \"0\",\n",
+    "                \"Name\": \"alpha\",\n",
+    "            },\n",
+    "        ],\n",
+    "        \"IntegerParameterRanges\": [\n",
+    "            {\n",
+    "                \"MaxValue\": \"10\",\n",
+    "                \"MinValue\": \"0\",\n",
+    "                \"Name\": \"max_depth\",\n",
+    "            },\n",
+    "            {\n",
+    "                \"MaxValue\": \"4000\",\n",
+    "                \"MinValue\": \"1\",\n",
+    "                \"Name\": \"num_round\",\n",
+    "            }\n",
+    "        ],\n",
+    "    },\n",
+    "    # SageMaker sets the following default limits for resources used by automatic model tuning:\n",
+    "    # https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-limits.html\n",
+    "    \"ResourceLimits\": {\n",
+    "        # Increase the max number of training jobs for increased accuracy (and training time).\n",
+    "        \"MaxNumberOfTrainingJobs\": 6, \n",
+    "        # Change parallel training jobs run by AMT to reduce total training time. Constrained by your account limits.\n",
+    "        # if max_jobs=max_parallel_jobs then Bayesian search turns to Random.\n",
+    "        \"MaxParallelTrainingJobs\": 2\n",
+    "    },\n",
+    "    \"Strategy\": \"Bayesian\",\n",
+    "    \"HyperParameterTuningJobObjective\": {\"MetricName\": \"validation:rmse\", \"Type\": \"Minimize\"},\n",
+    "}\n",
+    "\n",
+    "training_job_definition = {\n",
+    "    \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"File\"},\n",
+    "    \"InputDataConfig\": [\n",
+    "        {\n",
+    "            \"ChannelName\": \"train\",\n",
+    "            \"DataSource\": {\n",
+    "                \"S3DataSource\": {\n",
+    "                    \"S3DataType\": \"S3Prefix\",\n",
+    "                    \"S3Uri\": f\"{output_bucket_path}/{output_prefix}/train\",\n",
+    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
+    "                }\n",
+    "            },\n",
+    "            \"ContentType\": \"libsvm\",\n",
+    "            \"CompressionType\": \"None\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"ChannelName\": \"validation\",\n",
+    "            \"DataSource\": {\n",
+    "                \"S3DataSource\": {\n",
+    "                    \"S3DataType\": \"S3Prefix\",\n",
+    "                    \"S3Uri\": f\"{output_bucket_path}/{output_prefix}/validation\",\n",
+    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
+    "                }\n",
+    "            },\n",
+    "            \"ContentType\": \"libsvm\",\n",
+    "            \"CompressionType\": \"None\",\n",
+    "        },\n",
+    "    ],\n",
+    "    \"OutputDataConfig\": {\"S3OutputPath\": f\"{output_bucket_path}/{output_prefix}/single-xgboost\"},\n",
+    "    \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 5},\n",
+    "    \"RoleArn\": role,\n",
+    "    \"StaticHyperParameters\": {\n",
+    "        \"objective\": \"reg:linear\",\n",
+    "        \"verbosity\": \"2\",\n",
+    "    },\n",
+    "    \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 43200},\n",
+    "}\n",
+    "\n",
+    "print(f\"Creating a tuning job with name: {tuning_job_name}. It will take between 12 and 17 minutes to complete.\")\n",
+    "client.create_hyper_parameter_tuning_job(\n",
+    "    HyperParameterTuningJobName=tuning_job_name,\n",
+    "    HyperParameterTuningJobConfig=tuning_job_config,\n",
+    "    TrainingJobDefinition=training_job_definition,\n",
+    ")\n",
+    "\n",
+    "status = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n",
+    "    \"HyperParameterTuningJobStatus\"\n",
+    "]\n",
+    "print(status)\n",
+    "while status != \"Completed\" and status != \"Failed\":\n",
+    "    time.sleep(60)\n",
+    "    status = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n",
+    "            \"HyperParameterTuningJobStatus\"\n",
+    "    ]\n",
+    "    print(status)\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -217,10 +370,15 @@
     "import boto3\n",
     "from time import gmtime, strftime\n",
     "\n",
-    "model_name = f\"{job_name}-model\"\n",
+    "if deploy_amt_model == True:\n",
+    "    training_of_model_to_be_hosted = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\"BestTrainingJob\"][\"TrainingJobName\"]\n",
+    "else:\n",
+    "    training_of_model_to_be_hosted = training_job_name\n",
+    "    \n",
+    "model_name = f\"{training_of_model_to_be_hosted}-model\"\n",
     "print(model_name)\n",
     "\n",
-    "info = client.describe_training_job(TrainingJobName=job_name)\n",
+    "info = client.describe_training_job(TrainingJobName=training_of_model_to_be_hosted)\n",
     "model_data = info[\"ModelArtifacts\"][\"S3ModelArtifacts\"]\n",
     "print(model_data)\n",
     "\n",
@@ -251,7 +409,7 @@
     "from time import gmtime, strftime\n",
     "\n",
     "endpoint_config_name = f\"DEMO-XGBoostEndpointConfig-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
-    "print(endpoint_config_name)\n",
+    "print(f\"Creating endpoint config with name: {endpoint_config_name}.\")\n",
     "create_endpoint_config_response = client.create_endpoint_config(\n",
     "    EndpointConfigName=endpoint_config_name,\n",
     "    ProductionVariants=[\n",
@@ -288,7 +446,7 @@
     "import time\n",
     "\n",
     "endpoint_name = f'DEMO-XGBoostEndpoint-{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
-    "print(endpoint_name)\n",
+    "print(f\"Creating endpoint with name: {endpoint_name}. This will take between 9 and 11 minutes to complete.\")\n",
     "create_endpoint_response = client.create_endpoint(\n",
     "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
     ")\n",