diff --git a/sagemaker-training-compiler/tensorflow/single_gpu_single_node/hyper-parameter-tuning.ipynb b/sagemaker-training-compiler/tensorflow/single_gpu_single_node/hyper-parameter-tuning.ipynb new file mode 100644 index 0000000000..7e23b24994 --- /dev/null +++ b/sagemaker-training-compiler/tensorflow/single_gpu_single_node/hyper-parameter-tuning.ipynb @@ -0,0 +1,718 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "48d008c6", + "metadata": {}, + "source": [ + "# Compile and Tune a Vision Transformer Model using HyperParameter Tuner on a Single Node " + ] + }, + { + "cell_type": "markdown", + "id": "e3a3be53", + "metadata": {}, + "source": [ + "1. [SageMaker Training Compiler Overview](#SageMaker-Training-Compiler-Overview)\n", + " 1. [Introduction](#Introduction)\n", + "2. [Working with the Caltech-256 dataset](#Working-with-the-Caltech-256-dataset)\n", + " 1. [Installation](#Installation)\n", + " 2. [SageMaker environment](#SageMaker-environment)\n", + "3. [How effective is SageMaker Training Compiler?](#How-effective-is-SageMaker-Training-Compiler-?)\n", + " 1. [SageMaker Training Job](#SageMaker-Training-Job)\n", + " 2. [Training Setup](#Training-Setup)\n", + " 3. [Experimenting with Native TensorFlow](#Experimenting-with-Native-TensorFlow)\n", + " 4. [Experimenting with Optimized TensorFlow](#Experimenting-with-Optimized-TensorFlow)\n", + " 5. [Wait for tuning jobs to complete](#Wait-for-tuning-jobs-to-complete)\n", + " 6. [Fastest Training Job](#Fastest-Training-Job)\n", + "4. [Continue tuning with SageMaker Training Compiler](#Continue-tuning-with-SageMaker-Training-Compiler)\n", + " 1. [Wait for tuning jobs to complete](#Wait-for-tuning-jobs-to-complete)\n", + " 2. [Fastest Convergence](#Fastest-Convergence)\n", + "5. [Conclusion](#Conclusion)\n", + "6. [Clean up](#Clean-up)\n" + ] + }, + { + "cell_type": "markdown", + "id": "420ad24e", + "metadata": {}, + "source": [ + "## SageMaker Training Compiler Overview\n", + "\n", + "SageMaker Training Compiler is a capability of SageMaker that makes hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. \n", + "\n", + "SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. \n", + "\n", + "For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.\n", + "\n", + "### Introduction\n", + "\n", + "In this demo, you'll use SageMaker Training Compiler and SageMaker Hyperparameter Tuner to speed up training the `Vision Transformer` model on the `Caltech-256` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n", + "\n", + "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the TensorFlow-based kernels, `Python 3 (TensorFlow x.y Python 3.x CPU Optimized)` or `conda_tensorflow_p39` respectively.\n", + "\n", + "**NOTE:** This notebook uses 20 `ml.p3.2xlarge` instances, each with a single GPU. However, it can easily be extended to multiple GPUs on a single node. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). " + ] + }, + { + "cell_type": "markdown", + "id": "e86b8d5b", + "metadata": {}, + "source": [ + "## Development Environment \n" + ] + }, + { + "cell_type": "markdown", + "id": "8d6666d6", + "metadata": {}, + "source": [ + "### Installation\n", + "\n", + "This example notebook requires **SageMaker Python SDK v2.95.0 or later**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f1a4aca", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pip --upgrade\n", + "!pip install \"sagemaker>=2.95.0\" botocore boto3 awscli --upgrade" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a93e2a4", + "metadata": {}, + "outputs": [], + "source": [ + "import botocore\n", + "import boto3\n", + "import sagemaker\n", + "\n", + "print(f\"botocore: {botocore.__version__}\")\n", + "print(f\"boto3: {boto3.__version__}\")\n", + "print(f\"sagemaker: {sagemaker.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "id": "2b30e484", + "metadata": {}, + "source": [ + "### SageMaker environment " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "946fe532", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker\n", + "\n", + "sess = sagemaker.Session()\n", + "\n", + "# SageMaker session bucket -> used for uploading data, models and logs\n", + "# SageMaker will automatically create this bucket if it does not exist\n", + "sagemaker_session_bucket = None\n", + "if sagemaker_session_bucket is None and sess is not None:\n", + " # set to default bucket if a bucket name is not given\n", + " sagemaker_session_bucket = sess.default_bucket()\n", + "\n", + "role = sagemaker.get_execution_role()\n", + "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", + "\n", + "print(f\"sagemaker role arn: {role}\")\n", + "print(f\"sagemaker bucket: {sagemaker_session_bucket}\")\n", + "print(f\"sagemaker session region: {sess.boto_region_name}\")" + ] + }, + { + "cell_type": "markdown", + "id": "6774ae6f", + "metadata": {}, + "source": [ + "## Working with the Caltech-256 dataset\n", + "\n", + "We have hosted the [Caltech-256](https://authors.library.caltech.edu/7694/) dataset in S3 in us-west-2. We will transfer this dataset to your account and region for use with SageMaker Training.\n", + "\n", + "The dataset consists of JPEG images organized into directories with each directory representing an object category." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24aea98e", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "source = \"s3://sagemaker-sample-files/datasets/image/caltech-256/256_ObjectCategories\"\n", + "destn = f\"s3://{sagemaker_session_bucket}/caltech-256\"\n", + "local = \"caltech-256\"\n", + "\n", + "os.system(f\"aws s3 sync {source} {local}\")\n", + "os.system(f\"aws s3 sync {local} {destn}\")" + ] + }, + { + "cell_type": "markdown", + "id": "0b4decc2", + "metadata": {}, + "source": [ + "## How effective is SageMaker Training Compiler?\n", + "\n", + "The effectiveness of SageMaker Training Compiler depends on the model architecture, model size, input shape, and the training loop. Please refer to our [Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-tips-pitfalls.html) documentation to understand how to get the most out of your training job using SageMaker Training Compiler. In this section, we will compare and contrast a training job with and without SageMaker Training Compiler.\n", + "\n", + "\n", + "### SageMaker Training Job\n", + "\n", + "To create a SageMaker training job, we use a `TensorFlow` estimator. Using the estimator, you can define which training script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.\n", + "\n", + "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `TensorFlow` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.\n", + "\n", + "In the following section, you learn how to set up two versions of the SageMaker `TensorFlow` estimator, a native one without the compiler and an optimized one with the compiler." + ] + }, + { + "cell_type": "markdown", + "id": "cf4f1bba", + "metadata": {}, + "source": [ + "### Training Setup\n", + "\n", + "In this section, we set our hyperparameters to a naive first guess. Notice the low value for `EPOCHS` - this is because we are just experimenting with our hyperparameters to find the best setting that will lead to the fastest training. The effectiveness of SageMaker Training Compiler is often apparent within the first few steps. In the example below we will inspect the speed of the training job after every epoch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80ed60f0", + "metadata": {}, + "outputs": [], + "source": [ + "EPOCHS = 3\n", + "LEARNING_RATE = 1e-3\n", + "WEIGHT_DECAY = 1e-4" + ] + }, + { + "cell_type": "markdown", + "id": "973c2c67", + "metadata": {}, + "source": [ + "### Experimenting with Native TensorFlow\n", + "\n", + "We attempt to find the largest `BATCH_SIZE` that can fit into the memory of a `ml.p3.2xlarge` instance. This will consequently give us the fastest training speed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4db276b2", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tensorflow import TensorFlow\n", + "\n", + "estimator_args = dict(\n", + " source_dir=\"scripts\",\n", + " entry_point=\"vit.py\",\n", + " model_dir=False,\n", + " instance_type=\"ml.p3.2xlarge\",\n", + " instance_count=1,\n", + " framework_version=\"2.9.1\",\n", + " py_version=\"py39\",\n", + " debugger_hook_config=None,\n", + " disable_profiler=True,\n", + " max_run=60 * 20, # 20 minutes\n", + " role=role,\n", + ")\n", + "\n", + "# Configure the training job\n", + "native_estimator = TensorFlow(\n", + " hyperparameters={\n", + " \"EPOCHS\": EPOCHS,\n", + " \"LEARNING_RATE\": LEARNING_RATE,\n", + " \"WEIGHT_DECAY\": WEIGHT_DECAY,\n", + " },\n", + " **estimator_args,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8d0d62c2", + "metadata": {}, + "source": [ + "### SageMaker Hyperparameter Tuning Job\n", + "\n", + "We use the ```sagemaker.tuner.HyperparameterTuner``` object to define a Hyperparameter Tuning Job. It will import the training job configuration specified in the ```estimator```. We additionally specify some ```metric_definitions``` to extract training metrics from the training logs. From these ```metric_definitions``` we select a single metric as the ```objective_metric_name``` and configure the tuning job to ```Minimize``` or ```Maximize``` it. We further provide a constrained search space through the ```hyperparameter_ranges``` argument.\n", + "\n", + "We can limit the number of training jobs spawned concurrently in the ```max_parallel_jobs``` argument and limit the total number of training jobs spawned in the ```max_jobs``` argument.\n", + "\n", + "For more information regarding SageMaker Hyperparameter Tuner refer to [Perform Automatic Model Tuning with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)\n", + "\n", + "In the example below, we are trying to find the best batch size between 32 and 80 that will result in the smallest possible epoch latency, by launching 40 training jobs, 10 at a time. The range for batch sizes is our best guess. You can always reuse and restart a tuning job with an extended range, as explained in [Run a Warm Start Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32ebb26d", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tuner import HyperparameterTuner, IntegerParameter\n", + "\n", + "tuner_args = dict(\n", + " objective_metric_name=\"training_latency_per_epoch\",\n", + " objective_type=\"Minimize\",\n", + " metric_definitions=[\n", + " {\"Name\": \"training_loss\", \"Regex\": \"loss: ([0-9.]*?) \"},\n", + " {\"Name\": \"training_accuracy\", \"Regex\": \"accuracy: ([0-9.]*?) \"},\n", + " {\"Name\": \"training_latency_per_epoch\", \"Regex\": \"- ([0-9.]*?)s/epoch\"},\n", + " {\"Name\": \"training_avg_latency_per_step\", \"Regex\": \"- ([0-9.]*?)ms/step\"},\n", + " {\"Name\": \"training_avg_latency_per_step\", \"Regex\": \"- ([0-9.]*?)s/step\"},\n", + " ],\n", + " max_jobs=40,\n", + " max_parallel_jobs=10,\n", + " early_stopping_type=\"Auto\",\n", + ")\n", + "\n", + "\n", + "# Define a Hyperparameter Tuning Job\n", + "native_tuner = HyperparameterTuner(\n", + " estimator=native_estimator,\n", + " hyperparameter_ranges={\n", + " \"BATCH_SIZE\": IntegerParameter(32, 80, \"Linear\"),\n", + " },\n", + " base_tuning_job_name=\"native-tf29-vit\",\n", + " **tuner_args,\n", + ")\n", + "\n", + "# Start the tuning job with the specified input data\n", + "native_tuner.fit(inputs=destn, wait=False)\n", + "\n", + "# Save the name of the tuning job\n", + "native_tuner.latest_tuning_job.name" + ] + }, + { + "cell_type": "markdown", + "id": "2aada06a", + "metadata": {}, + "source": [ + "**Tip**: You can reduce the cost of tuning by restricting the batch size to be multiple of 8. Refer to Nvidia's article on the [significance of the number 8](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/) when training with Automatic Mixed Precision.\n", + "\n", + "```python\n", + "from sagemaker.tuner import CategoricalParameter\n", + "hyperparameter_ranges={\n", + " 'BATCH_SIZE': CategoricalParameter(list(range(32, 80, 8))),\n", + " }\n", + "```\n", + "\n", + "This can restrict the search space to just 6 training jobs as opposed to 40!" + ] + }, + { + "cell_type": "markdown", + "id": "0a49b0ac", + "metadata": {}, + "source": [ + "### Experimenting with Optimized TensorFlow\n", + "\n", + "Compilation through SageMaker Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. In the example below we will find the new batch size with SageMaker Training Compiler enabled and the resultant latency per epoch.\n", + "\n", + "**Note:** We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90d7a995", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig\n", + "\n", + "# Configure the training job\n", + "optimized_estimator = TensorFlow(\n", + " hyperparameters={\n", + " \"EPOCHS\": EPOCHS,\n", + " \"LEARNING_RATE\": LEARNING_RATE,\n", + " \"WEIGHT_DECAY\": WEIGHT_DECAY,\n", + " },\n", + " compiler_config=TrainingCompilerConfig(),\n", + " **estimator_args,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c92785c6", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tuner import HyperparameterTuner, IntegerParameter\n", + "\n", + "# Define the tuning job\n", + "optimized_tuner = HyperparameterTuner(\n", + " estimator=optimized_estimator,\n", + " hyperparameter_ranges={\"BATCH_SIZE\": IntegerParameter(20, 60, \"Linear\")},\n", + " base_tuning_job_name=\"optimized-tf29-vit\",\n", + " **tuner_args,\n", + ")\n", + "\n", + "# Start the tuning job with the specified input data\n", + "optimized_tuner.fit(inputs=destn, wait=False)\n", + "\n", + "# Save the name of the tuning job\n", + "optimized_tuner.latest_tuning_job.name" + ] + }, + { + "cell_type": "markdown", + "id": "d382f1dc", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "### Wait for tuning jobs to complete\n", + "\n", + "The tuning jobs described above typically take around 50 mins to complete" + ] + }, + { + "cell_type": "markdown", + "id": "406462ea", + "metadata": {}, + "source": [ + "**Note:** If the tuner object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the tuning job to a new tuner. For example:\n", + "\n", + "```python\n", + "native_tuner = HyperparameterTuner.attach(\"\")\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9633151d", + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "native_tuner.wait()\n", + "optimized_tuner.wait()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50b80394", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "from sagemaker.tuner import HyperparameterTuner\n", + "\n", + "native_tuner = HyperparameterTuner.attach(native_tuner.latest_tuning_job.name)\n", + "optimized_tuner = HyperparameterTuner.attach(optimized_tuner.latest_tuning_job.name)\n", + "\n", + "native_tuner.wait()\n", + "optimized_tuner.wait()" + ] + }, + { + "cell_type": "markdown", + "id": "606bef36", + "metadata": {}, + "source": [ + "### Fastest Training Job\n", + "\n", + "Let us collate and analyze the results from the tuning jobs. The tuner provides the results as a Pandas dataframe.\n", + "We combine the results from both the tuners, sort them according to the epoch latency and display the top 5 results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1a9e8b4", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "pd.set_option(\"display.max_rows\", None, \"display.max_columns\", None)\n", + "\n", + "# Collect the results from the tuners\n", + "native_results = native_tuner.analytics().dataframe()\n", + "optimized_results = optimized_tuner.analytics().dataframe()\n", + "\n", + "# Sort results according to Epoch Latency\n", + "native_results = native_results.sort_values(\n", + " [\"FinalObjectiveValue\", \"BATCH_SIZE\"], ascending=[True, False]\n", + ")\n", + "optimized_results = optimized_results.sort_values(\n", + " [\"FinalObjectiveValue\", \"BATCH_SIZE\"], ascending=[True, False]\n", + ")\n", + "\n", + "# Combine the top N results for viewing\n", + "N = 3\n", + "results = pd.concat([native_results.head(N), optimized_results.head(N)])\n", + "results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4bf10f41", + "metadata": {}, + "outputs": [], + "source": [ + "# Calculating potential percentage Savings from Training Compiler\n", + "difference = (\n", + " native_results.iloc[0][\"FinalObjectiveValue\"] - optimized_results.iloc[3][\"FinalObjectiveValue\"]\n", + ")\n", + "percentage = difference * 100 / native_results.iloc[0][\"FinalObjectiveValue\"]\n", + "\n", + "f\"With the SageMaker Training Compiler the epoch latency is {percentage:.1f}% lower \" f\"meaning the training job could be upto {percentage:.1f}% faster!\"" + ] + }, + { + "cell_type": "markdown", + "id": "e1561758", + "metadata": {}, + "source": [ + "## Continue tuning with SageMaker Training Compiler\n", + "\n", + "Now that we have the fastest batch size and compiler configuration, we need to tune the associated hyperparameters to get the fastest convergence.\n", + "\n", + "**Remember** Total_Training_Time ~= Latency_per_epoch * Number_of_epochs\n", + "\n", + "First, we tuned to reduce the Latency_per_epoch. Now we will tune to reduce the number of epochs required for convergence. Since, hyperparameters that directly affect convergence (like learning rate, weight decay, learning schedule, etc.) are dependent on batch size, we decouple the 2 steps as described. \n", + "\n", + "We now train for a higher number of epochs since we are testing the speed of convergence. Ideally, you should tune learning rate and weight decay to minimize validation loss, but for the sake of example let's minimize the training loss. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1bdd22ad", + "metadata": {}, + "outputs": [], + "source": [ + "EPOCHS = 10\n", + "BATCH_SIZE = 56\n", + "\n", + "estimator_args[\"max_run\"] = 60 * 60 # 60 minutes\n", + "\n", + "tuner_args[\"objective_metric_name\"] = \"training_loss\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64e65bdb", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig\n", + "\n", + "# Configure the training job\n", + "convergence_estimator = TensorFlow(\n", + " hyperparameters={\n", + " \"EPOCHS\": EPOCHS,\n", + " \"BATCH_SIZE\": BATCH_SIZE,\n", + " },\n", + " compiler_config=TrainingCompilerConfig(),\n", + " **estimator_args,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6370e536", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tuner import HyperparameterTuner, ContinuousParameter\n", + "\n", + "# Define the tunung job\n", + "convergence_tuner = HyperparameterTuner(\n", + " estimator=convergence_estimator,\n", + " hyperparameter_ranges={\n", + " \"LEARNING_RATE\": ContinuousParameter(1e-6, 1e-3, \"Logarithmic\"),\n", + " \"WEIGHT_DECAY\": ContinuousParameter(1e-6, 1e-3, \"Logarithmic\"),\n", + " },\n", + " base_tuning_job_name=\"optimized-tf29-vit\",\n", + " **tuner_args,\n", + ")\n", + "\n", + "# Start the tuning job with the specified input data\n", + "convergence_tuner.fit(inputs=destn, wait=False)\n", + "\n", + "# Save the name of the tuning job\n", + "convergence_tuner.latest_tuning_job.name" + ] + }, + { + "cell_type": "markdown", + "id": "f2457bd4", + "metadata": {}, + "source": [ + "### Wait for tuning jobs to complete\n", + "\n", + "The tuning jobs described above typically take around 2 hours to complete" + ] + }, + { + "cell_type": "markdown", + "id": "9d883251", + "metadata": {}, + "source": [ + "**Note:** If the tuner object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the tuning job to a new tuner. For example:\n", + "\n", + "```python\n", + "tuner = HyperparameterTuner.attach(\"\")\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70daa8a7", + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "convergence_tuner.wait()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6022a1c6", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "from sagemaker.tuner import HyperparameterTuner\n", + "\n", + "convergence_tuner = HyperparameterTuner.attach(convergence_tuner.latest_tuning_job.name)\n", + "\n", + "convergence_tuner.wait()" + ] + }, + { + "cell_type": "markdown", + "id": "6df94c59", + "metadata": {}, + "source": [ + "### Fastest Convergence\n", + "\n", + "Let us analyze the results from the tuning jobs. The tuner provides the results as a Pandas dataframe.\n", + "We sort by training loss and display the top 5 results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ef7e9c5c", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "pd.set_option(\"display.max_rows\", None, \"display.max_columns\", None)\n", + "\n", + "# Gather results from the tuner\n", + "results = convergence_tuner.analytics().dataframe()\n", + "\n", + "# Sort according to Training Loss\n", + "results = results.sort_values([\"FinalObjectiveValue\"], ascending=[True])\n", + "\n", + "# Display the top 5 results\n", + "results.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "ee9d8fe8", + "metadata": {}, + "source": [ + "Having obtained the best configuration for your training job, you can now train to completion. Please consider [check-pointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) in order to resume training from the best performing job indicated by the tuner.\n", + "\n", + "## Conclusion\n", + "\n", + "In conclusion, we first arrived at the batch size and compiler configuration that leads to the highest training throughput. Then, we tuned the associated hyperparameters to arrive at the configuration that leads to the fastest convergence. The resultant combinations lead to maximum savings !\n" + ] + }, + { + "cell_type": "markdown", + "id": "bc219a22", + "metadata": {}, + "source": [ + "## Clean up\n", + "\n", + "Stop all tuning jobs launched if the jobs are still running." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6f55df5", + "metadata": {}, + "outputs": [], + "source": [ + "native_tuner.stop_tuning_job()\n", + "optimized_tuner.stop_tuning_job()\n", + "convergence_tuner.stop_tuning_job()" + ] + }, + { + "cell_type": "markdown", + "id": "685e6c99", + "metadata": {}, + "source": [ + "Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_tensorflow2_p38", + "language": "python", + "name": "conda_tensorflow2_p38" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}