From f6cefa857cfc852995610826cabdd739fde3a58e Mon Sep 17 00:00:00 2001 From: vivekmadan2 <53404938+vivekmadan2@users.noreply.github.com> Date: Tue, 6 Sep 2022 13:42:03 -0400 Subject: [PATCH] built-in algorithm - tensorflow image classification: Pull Cloudwatch logs (#3590) Co-authored-by: Vivek Madan --- ...azon_TensorFlow_Image_Classification.ipynb | 224 +++++++++++++----- 1 file changed, 161 insertions(+), 63 deletions(-) diff --git a/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb b/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb index 40f68adf54..acb43d9117 100644 --- a/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb +++ b/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "c5b50b55", + "id": "491b904e", "metadata": {}, "source": [ "# Introduction to SageMaker TensorFlow - Image Classification" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "e718cb54", + "id": "e8df953a", "metadata": {}, "source": [ "---\n", @@ -28,7 +28,7 @@ }, { "cell_type": "markdown", - "id": "7d1939f5", + "id": "3acf3836", "metadata": {}, "source": [ "1. [Set Up](#1.-Set-Up)\n", @@ -43,13 +43,14 @@ " * [Set Training parameters](#4.2.-Set-Training-parameters)\n", " * [Train with Automatic Model Tuning (HPO)](#AMT)\n", " * [Start Training](#4.4.-Start-Training)\n", - " * [Deploy & run Inference on the fine-tuned model](#4.5.-Deploy-&-run-Inference-on-the-fine-tuned-model)\n", - " * [Incrementally train the fine-tuned model](#4.6.-Incrementally-train-the-fine-tuned-model)" + " * [Extract Training performance metrics](#4.5.-Extract-Training-performance-metrics)\n", + " * [Deploy & run Inference on the fine-tuned model](#4.6.-Deploy-&-run-Inference-on-the-fine-tuned-model)\n", + " * [Incrementally train the fine-tuned model](#4.7.-Incrementally-train-the-fine-tuned-model)" ] }, { "cell_type": "markdown", - "id": "f9f252e3", + "id": "99e04731", "metadata": {}, "source": [ "## 1. Set Up\n", @@ -61,7 +62,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e45065a1", + "id": "a536a5dd", "metadata": {}, "outputs": [], "source": [ @@ -70,7 +71,7 @@ }, { "cell_type": "markdown", - "id": "fe18f520", + "id": "951e8b8a", "metadata": {}, "source": [ "---\n", @@ -83,7 +84,7 @@ { "cell_type": "code", "execution_count": null, - "id": "343deffb", + "id": "0ab99140", "metadata": {}, "outputs": [], "source": [ @@ -98,7 +99,7 @@ }, { "cell_type": "markdown", - "id": "1a88d949", + "id": "634dd01d", "metadata": {}, "source": [ "## 2. Select a pre-trained model\n", @@ -110,7 +111,7 @@ { "cell_type": "code", "execution_count": null, - "id": "41357d17", + "id": "e3f1d777", "metadata": { "jumpStartAlterations": [ "modelIdVersion" @@ -123,7 +124,7 @@ }, { "cell_type": "markdown", - "id": "ec35e3e5", + "id": "772154b7", "metadata": {}, "source": [ "***\n", @@ -134,7 +135,7 @@ { "cell_type": "code", "execution_count": null, - "id": "cb0807c3", + "id": "d7cb33f6", "metadata": {}, "outputs": [], "source": [ @@ -161,7 +162,7 @@ }, { "cell_type": "markdown", - "id": "39939c07", + "id": "c50ca21f", "metadata": {}, "source": [ "## 3. Run inference on the pre-trained model\n", @@ -172,7 +173,7 @@ }, { "cell_type": "markdown", - "id": "429361b2", + "id": "25d49542", "metadata": {}, "source": [ "### 3.1. Retrieve Artifacts & Deploy an Endpoint\n", @@ -184,7 +185,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3175cd6c", + "id": "6e0b50d5", "metadata": {}, "outputs": [], "source": [ @@ -238,7 +239,7 @@ }, { "cell_type": "markdown", - "id": "bb880a6d", + "id": "2ea5496e", "metadata": {}, "source": [ "### 3.2. Download example images for inference\n", @@ -250,7 +251,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bc773407", + "id": "60b98034", "metadata": {}, "outputs": [], "source": [ @@ -269,7 +270,7 @@ }, { "cell_type": "markdown", - "id": "7ff3fb64", + "id": "a435058a", "metadata": {}, "source": [ "### 3.3. Query endpoint and parse response\n", @@ -281,7 +282,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e6627767", + "id": "2306b489", "metadata": {}, "outputs": [], "source": [ @@ -315,7 +316,7 @@ }, { "cell_type": "markdown", - "id": "08b7fa6f", + "id": "797169e7", "metadata": {}, "source": [ "### 3.4. Clean up the endpoint" @@ -324,7 +325,7 @@ { "cell_type": "code", "execution_count": null, - "id": "36c93e25", + "id": "835e888b", "metadata": {}, "outputs": [], "source": [ @@ -335,7 +336,7 @@ }, { "cell_type": "markdown", - "id": "cf1c3ed4", + "id": "504466ea", "metadata": {}, "source": [ "## 4. Fine-tune the pre-trained model on a custom dataset\n", @@ -365,7 +366,7 @@ }, { "cell_type": "markdown", - "id": "d38e20c6", + "id": "bbe2c89a", "metadata": {}, "source": [ "### 4.1. Retrieve Training artifacts\n", @@ -377,7 +378,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e7ef93bb", + "id": "d47fb1e8", "metadata": {}, "outputs": [], "source": [ @@ -407,7 +408,7 @@ }, { "cell_type": "markdown", - "id": "522d8fa6", + "id": "483cbb5b", "metadata": {}, "source": [ "### 4.2. Set Training parameters\n", @@ -425,7 +426,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d2b1f26a", + "id": "4a6897f2", "metadata": {}, "outputs": [], "source": [ @@ -443,7 +444,7 @@ }, { "cell_type": "markdown", - "id": "abf366a1", + "id": "410123e7", "metadata": {}, "source": [ "***\n", @@ -454,7 +455,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2ce3e271", + "id": "d4265a8f", "metadata": {}, "outputs": [], "source": [ @@ -470,7 +471,7 @@ }, { "cell_type": "markdown", - "id": "0ccb5352", + "id": "0095df25", "metadata": {}, "source": [ "### 4.3. Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) \n", @@ -482,7 +483,7 @@ { "cell_type": "code", "execution_count": null, - "id": "812e2197", + "id": "147d2dc8", "metadata": {}, "outputs": [], "source": [ @@ -492,15 +493,9 @@ "use_amt = False\n", "\n", "# Define objective metric per framework, based on which the best model will be selected.\n", - "metric_definitions_per_model = {\n", - " \"tensorflow\": {\n", - " \"metrics\": [{\"Name\": \"val_accuracy\", \"Regex\": \"val_accuracy: ([0-9\\\\.]+)\"}],\n", - " \"type\": \"Maximize\",\n", - " },\n", - " \"pytorch\": {\n", - " \"metrics\": [{\"Name\": \"val_accuracy\", \"Regex\": \"val Acc: ([0-9\\\\.]+)\"}],\n", - " \"type\": \"Maximize\",\n", - " },\n", + "amt_metric_definitions = {\n", + " \"metrics\": [{\"Name\": \"val_accuracy\", \"Regex\": \"val_accuracy: ([0-9\\\\.]+)\"}],\n", + " \"type\": \"Maximize\",\n", "}\n", "\n", "# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)\n", @@ -517,7 +512,7 @@ }, { "cell_type": "markdown", - "id": "59a61921", + "id": "c3011dd2", "metadata": {}, "source": [ "### 4.4. Start Training\n", @@ -529,7 +524,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f3b68607", + "id": "5068463b", "metadata": {}, "outputs": [], "source": [ @@ -539,6 +534,13 @@ "\n", "training_job_name = name_from_base(f\"jumpstart-example-{model_id}-transfer-learning\")\n", "\n", + "training_metric_definitions = [\n", + " {\"Name\": \"val_accuracy\", \"Regex\": \"val_accuracy: ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"val_loss\", \"Regex\": \"val_loss: ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"train_accuracy\", \"Regex\": \"- accuracy: ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"train_loss\", \"Regex\": \"- loss: ([0-9\\\\.]+)\"},\n", + "]\n", + "\n", "# Create SageMaker Estimator instance\n", "ic_estimator = Estimator(\n", " role=aws_role,\n", @@ -552,21 +554,19 @@ " hyperparameters=hyperparameters,\n", " output_path=s3_output_location,\n", " base_job_name=training_job_name,\n", + " metric_definitions=training_metric_definitions,\n", ")\n", "\n", "if use_amt:\n", - " metric_definitions = next(\n", - " value for key, value in metric_definitions_per_model.items() if model_id.startswith(key)\n", - " )\n", "\n", " hp_tuner = HyperparameterTuner(\n", " ic_estimator,\n", - " metric_definitions[\"metrics\"][0][\"Name\"],\n", + " amt_metric_definitions[\"metrics\"][0][\"Name\"],\n", " hyperparameter_ranges,\n", - " metric_definitions[\"metrics\"],\n", + " amt_metric_definitions[\"metrics\"],\n", " max_jobs=max_jobs,\n", " max_parallel_jobs=max_parallel_jobs,\n", - " objective_type=metric_definitions[\"type\"],\n", + " objective_type=amt_metric_definitions[\"type\"],\n", " base_tuning_job_name=training_job_name,\n", " )\n", "\n", @@ -579,10 +579,107 @@ }, { "cell_type": "markdown", - "id": "90304e0c", + "id": "6e75e44c", + "metadata": {}, + "source": [ + "### 4.5. Extract Training performance metrics\n", + "***\n", + "Performance metrics such as training accuracy/loss and validation accuracy/loss can be accessed through cloudwatch while the training. Code below provides the link to the cloudwatch log where these metrics can be found. \n", + "\n", + "Note that default resolution in Amazon Cloudwatch is one minute i.e. it averages the metrics logged within a single minute interval. Amazon CloudWatch also supports [high-resolution custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html), and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the Amazon CloudWatch API Reference.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6120c260", + "metadata": {}, + "outputs": [], + "source": [ + "if use_amt:\n", + " training_job_name = hp_tuner.best_training_job()\n", + "else:\n", + " training_job_name = ic_estimator.latest_training_job.job_name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "422ac8fc", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker\n", + "from IPython.core.display import Markdown\n", + "\n", + "sagemaker_session = sagemaker.Session()\n", + "\n", + "link = (\n", + " \"https://console.aws.amazon.com/cloudwatch/home?region=\"\n", + " + sagemaker_session.boto_region_name\n", + " + \"#metricsV2:query=%7B/aws/sagemaker/TrainingJobs,TrainingJobName%7D%20\"\n", + " + training_job_name\n", + ")\n", + "display(Markdown(\"CloudWatch metrics: [link](\" + link + \")\"))" + ] + }, + { + "cell_type": "markdown", + "id": "cd15c4bb", + "metadata": {}, + "source": [ + "***\n", + "Alternatively, we can also fetch these metrics and analyze them within the notebook.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d915b42b", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import TrainingJobAnalytics\n", + "\n", + "df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()\n", + "\n", + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "dd0ab950", + "metadata": {}, + "source": [ + "***\n", + "We can filter out different metrics by names as well.\n", + "***" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df44f7f0", + "metadata": {}, + "outputs": [], + "source": [ + "metric_names = [metric[\"Name\"] for metric in training_metric_definitions]\n", + "\n", + "metrics_df = {\n", + " metric_name: df.query(f\"metric_name == '{metric_name}'\") for metric_name in metric_names\n", + "}\n", + "\n", + "metrics_df[\"val_loss\"]" + ] + }, + { + "cell_type": "markdown", + "id": "08072894", "metadata": {}, "source": [ - "## 4.5. Deploy & run Inference on the fine-tuned model\n", + "## 4.6. Deploy & run Inference on the fine-tuned model\n", "***\n", "A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an image. We follow the same steps as in the [Section 3 - Run inference on the pre-trained model](#3.-Run-inference-on-the-pre-trained-model). We start by retrieving the artifacts for deploying an endpoint. However, instead of base_predictor, we deploy the `ic_estimator` that we fine-tuned.\n", "***" @@ -591,7 +688,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1e1b318a", + "id": "7915265a", "metadata": {}, "outputs": [], "source": [ @@ -626,7 +723,7 @@ }, { "cell_type": "markdown", - "id": "1680c7b9", + "id": "bccf7925", "metadata": {}, "source": [ "---\n", @@ -638,7 +735,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f0a8d503", + "id": "f6eb2261", "metadata": {}, "outputs": [], "source": [ @@ -660,7 +757,7 @@ }, { "cell_type": "markdown", - "id": "006165b6", + "id": "2a3f382f", "metadata": {}, "source": [ "---\n", @@ -672,7 +769,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1bf49f4d", + "id": "94dc4f40", "metadata": {}, "outputs": [], "source": [ @@ -696,7 +793,7 @@ }, { "cell_type": "markdown", - "id": "cda81d4d", + "id": "8c672f19", "metadata": {}, "source": [ "---\n", @@ -708,7 +805,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7f58f448", + "id": "ad9f7b01", "metadata": {}, "outputs": [], "source": [ @@ -719,10 +816,10 @@ }, { "cell_type": "markdown", - "id": "c70b96df", + "id": "25ae6c5d", "metadata": {}, "source": [ - "## 4.6. Incrementally train the fine-tuned model\n", + "## 4.7. Incrementally train the fine-tuned model\n", "\n", "***\n", "Incremental training allows you to train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance. You can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources as you don’t need to retrain a model from scratch.\n", @@ -734,7 +831,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8b716544", + "id": "3e55c2b0", "metadata": {}, "outputs": [], "source": [ @@ -755,7 +852,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0f2b7c2a", + "id": "83d48880", "metadata": {}, "outputs": [], "source": [ @@ -777,6 +874,7 @@ " hyperparameters=hyperparameters,\n", " output_path=incremental_s3_output_location,\n", " base_job_name=incremental_training_job_name,\n", + " metric_definitions=training_metric_definitions,\n", ")\n", "\n", "incremental_train_estimator.fit({\"training\": training_dataset_s3_path}, logs=True)" @@ -784,7 +882,7 @@ }, { "cell_type": "markdown", - "id": "a54aa7a5", + "id": "ceb937a0", "metadata": {}, "source": [ "Once trained, we can use the same steps as in [Deploy & run Inference on the fine-tuned model](#4.5.-Deploy-&-run-Inference-on-the-fine-tuned-model) to deploy the model." @@ -813,4 +911,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +}