Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. #3623

Merged
merged 3 commits into from
Oct 20, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 175 additions & 17 deletions introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
"cells": [
Copy link
Contributor

@GiannisMitr GiannisMitr Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, first line, leverage could be leveraging


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #58.    training_of_model_to_be_hosted = training_job_name

I believe this is redundant, since you set this variable in hosting cell.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removing. This is a left-over from a refactoring I made before as I was cleaning up my initial implementation.

Copy link
Contributor

@GiannisMitr GiannisMitr Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #62.        status = client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]

The param passed in describe_training_job() call, shouldn't be training_job_name ?. Did you have any issue during Notebook execution?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - I don't recall having any issues but I do when I try again now. The last thing I did before creating the PR was to rename a couple of variables (to split job_name into tuning_job_name and training_job_name ) and it seems like I forgot to run the entire notebook again after that.

Copy link
Contributor

@GiannisMitr GiannisMitr Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #61.        "InputDataConfig": [

To reduce verbosity and lines counts, Can you export to a common object and re-use between training and tuning, shared fields from training_job_definition, sush as: InputDataConfig, OutputDataConfig, ResourceConfig, RoleArn


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I then decided that it was acceptable to have a bit of redundancy, if it helps the customer. In my mind, it would be easier for the customer to have all training or tuning related config in a single section that they can quickly copy paste, instead of having to piece things together from different cells.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point, but also I am thinking, that re-using fields, makes more evident to the user, that a lot of the configuration of TuningJob is actually common with TrainingJob, so they can focus more on the "actual" tuning configs. But I trust your instinct on that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #63.        print(status)  

nit: extra space if you're fixing other things


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.    tuning_job_config = {

Since this is meant to be an introductory notebook, it might be beneficial to add some basic exposition around how the tuning config is being defined here or even some links to HPO documentation to learn more outside of the notebook. To someone with no experience they might not even know where gamma, eta, etc. are coming from.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll add a comment about that at the top of the section, and link this page https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html which explains all the points you raised).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #51.            # Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.

Do we know what the default is? We don't have to note it here necessarily, but maybe it's worth calling out in case users are tempted to try a ridiculous number.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they're explained here https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-limits.html and I will add this link right before the ResourceLimits key.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #52.            # if max_jobs=max_parallel_jobs then Bayesian search turns to Random.

If there's a concise way to explain the practical effect this might have on this model that might be helpful and/or interesting to readers - not a big deal, though.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation is the following :

For Random Search there’s no cost. It’s just better to use a higher MaxParallelTrainingJobs (up to the limits) for maximum speed. 

For Bayesian search, the cost is that the higher MaxParallelTrainingJobs, the more similar Bayesian becomes to Random Search. The reason is that Bayesian won’t have all the sequential information to make the best decision on the hyperparameter to pick next. This means it may need more training jobs overall to find the optimum with a high parallel factor

A bit hard to summarise the above in a concise way but part of it is in a way already mentioned.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #108.    print(status)

If this is going to run significantly longer than the regular training job it might be good just to note that in a comment here so users know what to expect. See below where it's noted "This takes 9-11 minutes to complete." for endpoint deployment.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an opportunity to compare HPO tuned models and regularly trained models here in terms of prediction outputs?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, and it would help customers quantify the impact of using HPO vs Training, but this will require creating a separate endpoint for the model trained with Training (not a problem), and in general will increase the scope of the notebook. We could instead decide to have a dedicated "HPO vs Training" notebook whose purpose is to show that better performance could be achieved with a tuned model. I'd refrain though from adding this comparison in all the notebooks that I'm modifying (50+). What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you said makes sense. Maybe after we've created this versus notebook we can come back and add a reference to it in a comment or something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To create a tuning job using the AWS SageMaker Automatic Model Tuning API, you need to define 3 attributes.

"three parameters" instead "3 attributes."

(to specify settings...)

"(which specifies settings..."

(to configure the...

"(which configures the..."

To learn more about that,

"these" instead of "that"

These are nits, feel free to ignore.


Reply via ReviewNB

{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"tags": []
},
"source": [
"# Regression with Amazon SageMaker XGBoost algorithm\n",
"_**Single machine training for regression with Amazon SageMaker XGBoost algorithm**_\n",
Expand Down Expand Up @@ -102,7 +104,13 @@
"source": [
"## Training the XGBoost model\n",
"\n",
"After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes."
"After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes.\n",
"\n",
"Training can be done by either calling SageMaker Training with a set of hyperparameters values to train with, or by leveraging SageMaker Automatic Model Tuning ([AMT](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)). AMT, also known as hyperparameter tuning (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.\n",
"\n",
"In this notebook, both methods are used for demonstration purposes, but the model that the HPO job creates is the one that is eventually hosted. You can instead choose to deploy the model created by the standalone training job by changing the below variable `deploy_amt_model` to False.\n",
"\n",
"### Initiliazing common variables "
]
},
{
Expand All @@ -111,7 +119,16 @@
"metadata": {},
"outputs": [],
"source": [
"container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.5-1\")"
"container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.5-1\")\n",
"client = boto3.client(\"sagemaker\", region_name=region)\n",
"deploy_amt_model = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training with SageMaker Training"
]
},
{
Expand All @@ -123,9 +140,9 @@
"%%time\n",
"import boto3\n",
"from time import gmtime, strftime\n",
"import time\n",
"\n",
"job_name = f\"DEMO-xgboost-regression-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
"print(\"Training job\", job_name)\n",
"training_job_name = f\"DEMO-xgboost-regression-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
"\n",
"# Ensure that the training and validation data folders generated above are reflected in the \"InputDataConfig\" parameter below.\n",
"\n",
Expand All @@ -134,7 +151,7 @@
" \"RoleArn\": role,\n",
" \"OutputDataConfig\": {\"S3OutputPath\": f\"{output_bucket_path}/{output_prefix}/single-xgboost\"},\n",
" \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 5},\n",
" \"TrainingJobName\": job_name,\n",
" \"TrainingJobName\": training_job_name,\n",
" \"HyperParameters\": {\n",
" \"max_depth\": \"5\",\n",
" \"eta\": \"0.2\",\n",
Expand Down Expand Up @@ -174,17 +191,13 @@
" ],\n",
"}\n",
"\n",
"\n",
"client = boto3.client(\"sagemaker\", region_name=region)\n",
"print(f\"Creating a training job with name: {training_job_name}. It will take between 5 and 6 minutes to complete.\")\n",
"client.create_training_job(**create_training_params)\n",
"\n",
"import time\n",
"\n",
"status = client.describe_training_job(TrainingJobName=job_name)[\"TrainingJobStatus\"]\n",
"status = client.describe_training_job(TrainingJobName=training_job_name)[\"TrainingJobStatus\"]\n",
"print(status)\n",
"while status != \"Completed\" and status != \"Failed\":\n",
" time.sleep(60)\n",
" status = client.describe_training_job(TrainingJobName=job_name)[\"TrainingJobStatus\"]\n",
" status = client.describe_training_job(TrainingJobName=training_job_name)[\"TrainingJobStatus\"]\n",
" print(status)"
]
},
Expand All @@ -195,6 +208,146 @@
"Note that the \"validation\" channel has been initialized too. The SageMaker XGBoost algorithm actually calculates RMSE and writes it to the CloudWatch logs on the data passed to the \"validation\" channel."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tuning with SageMaker Automatic Model Tuning\n",
"\n",
"To create a tuning job using the AWS SageMaker Automatic Model Tuning API, you need to define 3 attributes. \n",
"\n",
"1. the tuning job name (string)\n",
"2. the tuning job config (to specify settings for the hyperparameter tuning job - JSON object)\n",
"3. training job definition (to configure the training jobs that the tuning job launches - JSON object).\n",
"\n",
"To learn more about that, refer to the [Configure and Launch a Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html) documentation.\n",
"\n",
"Note that the tuning job will 12-17 minutes to complete."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from time import gmtime, strftime, sleep\n",
"\n",
"tuning_job_name = \"DEMO-xgboost-reg-\" + strftime(\"%d-%H-%M-%S\", gmtime())\n",
"\n",
"tuning_job_config = {\n",
" \"ParameterRanges\": {\n",
" \"CategoricalParameterRanges\": [],\n",
" \"ContinuousParameterRanges\": [\n",
" {\n",
" \"MaxValue\": \"0.5\",\n",
" \"MinValue\": \"0.1\",\n",
" \"Name\": \"eta\",\n",
" },\n",
" {\n",
" \"MaxValue\": \"5\",\n",
" \"MinValue\": \"0\",\n",
" \"Name\": \"gamma\",\n",
" },\n",
" {\n",
" \"MaxValue\": \"120\",\n",
" \"MinValue\": \"0\",\n",
" \"Name\": \"min_child_weight\",\n",
" },\n",
" {\n",
" \"MaxValue\": \"1\",\n",
" \"MinValue\": \"0.5\",\n",
" \"Name\": \"subsample\",\n",
" },\n",
" {\n",
" \"MaxValue\": \"2\",\n",
" \"MinValue\": \"0\",\n",
" \"Name\": \"alpha\",\n",
" },\n",
" ],\n",
" \"IntegerParameterRanges\": [\n",
" {\n",
" \"MaxValue\": \"10\",\n",
" \"MinValue\": \"0\",\n",
" \"Name\": \"max_depth\",\n",
" },\n",
" {\n",
" \"MaxValue\": \"4000\",\n",
" \"MinValue\": \"1\",\n",
" \"Name\": \"num_round\",\n",
" }\n",
" ],\n",
" },\n",
" # SageMaker sets the following default limits for resources used by automatic model tuning:\n",
" # https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-limits.html\n",
" \"ResourceLimits\": {\n",
" # Increase the max number of training jobs for increased accuracy (and training time).\n",
" \"MaxNumberOfTrainingJobs\": 6, \n",
" # Change parallel training jobs run by AMT to reduce total training time. Constrained by your account limits.\n",
" # if max_jobs=max_parallel_jobs then Bayesian search turns to Random.\n",
" \"MaxParallelTrainingJobs\": 2\n",
" },\n",
" \"Strategy\": \"Bayesian\",\n",
" \"HyperParameterTuningJobObjective\": {\"MetricName\": \"validation:rmse\", \"Type\": \"Minimize\"},\n",
"}\n",
"\n",
"training_job_definition = {\n",
" \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"File\"},\n",
" \"InputDataConfig\": [\n",
" {\n",
" \"ChannelName\": \"train\",\n",
" \"DataSource\": {\n",
" \"S3DataSource\": {\n",
" \"S3DataType\": \"S3Prefix\",\n",
" \"S3Uri\": f\"{output_bucket_path}/{output_prefix}/train\",\n",
" \"S3DataDistributionType\": \"FullyReplicated\",\n",
" }\n",
" },\n",
" \"ContentType\": \"libsvm\",\n",
" \"CompressionType\": \"None\",\n",
" },\n",
" {\n",
" \"ChannelName\": \"validation\",\n",
" \"DataSource\": {\n",
" \"S3DataSource\": {\n",
" \"S3DataType\": \"S3Prefix\",\n",
" \"S3Uri\": f\"{output_bucket_path}/{output_prefix}/validation\",\n",
" \"S3DataDistributionType\": \"FullyReplicated\",\n",
" }\n",
" },\n",
" \"ContentType\": \"libsvm\",\n",
" \"CompressionType\": \"None\",\n",
" },\n",
" ],\n",
" \"OutputDataConfig\": {\"S3OutputPath\": f\"{output_bucket_path}/{output_prefix}/single-xgboost\"},\n",
" \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 5},\n",
" \"RoleArn\": role,\n",
" \"StaticHyperParameters\": {\n",
" \"objective\": \"reg:linear\",\n",
" \"verbosity\": \"2\",\n",
" },\n",
" \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 43200},\n",
"}\n",
"\n",
"print(f\"Creating a tuning job with name: {tuning_job_name}. It will take between 12 and 17 minutes to complete.\")\n",
"client.create_hyper_parameter_tuning_job(\n",
" HyperParameterTuningJobName=tuning_job_name,\n",
" HyperParameterTuningJobConfig=tuning_job_config,\n",
" TrainingJobDefinition=training_job_definition,\n",
")\n",
"\n",
"status = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n",
" \"HyperParameterTuningJobStatus\"\n",
"]\n",
"print(status)\n",
"while status != \"Completed\" and status != \"Failed\":\n",
" time.sleep(60)\n",
" status = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n",
" \"HyperParameterTuningJobStatus\"\n",
" ]\n",
" print(status)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -217,10 +370,15 @@
"import boto3\n",
"from time import gmtime, strftime\n",
"\n",
"model_name = f\"{job_name}-model\"\n",
"if deploy_amt_model == True:\n",
" training_of_model_to_be_hosted = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\"BestTrainingJob\"][\"TrainingJobName\"]\n",
"else:\n",
" training_of_model_to_be_hosted = training_job_name\n",
" \n",
"model_name = f\"{training_of_model_to_be_hosted}-model\"\n",
"print(model_name)\n",
"\n",
"info = client.describe_training_job(TrainingJobName=job_name)\n",
"info = client.describe_training_job(TrainingJobName=training_of_model_to_be_hosted)\n",
"model_data = info[\"ModelArtifacts\"][\"S3ModelArtifacts\"]\n",
"print(model_data)\n",
"\n",
Expand Down Expand Up @@ -251,7 +409,7 @@
"from time import gmtime, strftime\n",
"\n",
"endpoint_config_name = f\"DEMO-XGBoostEndpointConfig-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
"print(endpoint_config_name)\n",
"print(f\"Creating endpoint config with name: {endpoint_config_name}.\")\n",
"create_endpoint_config_response = client.create_endpoint_config(\n",
" EndpointConfigName=endpoint_config_name,\n",
" ProductionVariants=[\n",
Expand Down Expand Up @@ -288,7 +446,7 @@
"import time\n",
"\n",
"endpoint_name = f'DEMO-XGBoostEndpoint-{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
"print(endpoint_name)\n",
"print(f\"Creating endpoint with name: {endpoint_name}. This will take between 9 and 11 minutes to complete.\")\n",
"create_endpoint_response = client.create_endpoint(\n",
" EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
")\n",
Expand Down