forked from aws/amazon-sagemaker-examples
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add pytorch notebook from s3 (aws#156)
- Loading branch information
Showing
4 changed files
with
377 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,312 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Debugging SageMaker Training Jobs with Tornasole" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Overview" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. It lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors 'flowing through the graph' during training or inference.\n", | ||
"\n", | ||
"Using Tornasole is a two step process:\n", | ||
"\n", | ||
"### Saving tensors\n", | ||
"\n", | ||
"Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis\n", | ||
"\n", | ||
"### Analysis\n", | ||
"\n", | ||
"Analyses of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, Rules are a piece of analysis code that one writes to compares tensors across steps of a training job and analyze them in each step of the training job.\n", | ||
"\n", | ||
"The analysis of tensors saved requires the package `tornasole.rules`.\n", | ||
"\n", | ||
"This example guides through installation of the required components for emitting tensors in a SageMaker training job and applying a rule over the tensors to monitor the live status of the job." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setup\n", | ||
"\n", | ||
"As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .\n", | ||
"!pip -q install sagemaker-1.35.2.dev0.tar.gz\n", | ||
"!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json .\n", | ||
"!aws configure add-model --service-model sagemaker-tornasole.json --service-name sagemaker" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we've completed the setup, we're ready to spin off a training job with debugging enabled\n", | ||
"\n", | ||
"## Training with Script Mode\n", | ||
"\n", | ||
"We'll be training a TensorFlow model for Sentiment Analysis. This will be done using SageMaker TensorFlow 1.14 Container with Script Mode." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import boto3\n", | ||
"import sagemaker\n", | ||
"from sagemaker.pytorch import PyTorch" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Inputs\n", | ||
"\n", | ||
"Configuring the inputs for the training job" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"entry_point_script = 'simple.py'\n", | ||
"docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'\n", | ||
"#hyperparameters = {'epochs': 2, 'batch-size': 128, 'tornasole_path':'/opt/ml/output/tensors'}\n", | ||
"#--lr .01 --momentum 0.9 --tornasole-frequency 3 --steps 10 --hook-type saveall --random-seed True,\n", | ||
"hyperparameters = {'epochs': 2,'tornasole_path':'/opt/ml/output/tensors', 'lr' : 0.01, 'momentum' : 0.9, 'tornasole-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Parameters\n", | ||
"Now we'll call the TensorFlow Estimator to kick off a training job. The new parameters in the Estimator to look out for are\n", | ||
"\n", | ||
"##### `debug` (bool)\n", | ||
"This indicates that debugging should be enabled for the training job. Setting this as `True` would make Tornasole available for use with the job\n", | ||
"\n", | ||
"##### `rules_specification` (list[*dict*])\n", | ||
"This is a list of python dictionaries, where each `dict` is of the following form:\n", | ||
"```\n", | ||
"{\n", | ||
" \"RuleName\": <str> # The name of the class implementing the Tornasole Rule interface. (required)\n", | ||
" \"SourceS3Uri\": <str> # S3 URI of the rule script containing the class in 'RuleName'. If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. If not, SageMaker will try to look for the rule class in the script\n", | ||
" \"InstanceType\": <str> # The ml instance type in which the rule evaluation should run\n", | ||
" \"VolumeSizeInGB\": <int> # The volume size to store the runtime artifacts from the rule evaluation\n", | ||
" \"RuntimeConfigurations\": {\n", | ||
" # Map defining the parameters required to instantiate the Rule class and invoke the rule\n", | ||
" <str>: <str>\n", | ||
" }\n", | ||
"}\n", | ||
"```\n", | ||
"#### Storage\n", | ||
"The tensors are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-<job name>`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.\n", | ||
"\n", | ||
"If you don't provide an S3 output path to the estimator, SageMaker creates one for you as:\n", | ||
"**`s3://sagemaker-<region>-<account_id>/`**\n", | ||
"\n", | ||
"See the way we instantiate the estimator below" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Rule\n", | ||
"\n", | ||
"There are two ways to apply rules.\n", | ||
"1. Use a 1P rule. Specify the RuleName with the 1P RuleName, and the rule will be automatically applied. Here we are uing **`VanishingGradient`**. Leave `SourceS3Uri` empty if a 1P rule is needed.\n", | ||
"2. Use a custom rule script and specify the S3 location of the script in `SourceS3Uri`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Estimator" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sagemaker_execution_role = sagemaker.get_execution_role()\n", | ||
"#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\n", | ||
"estimator = PyTorch(role=sagemaker_execution_role,\n", | ||
" base_job_name='pytorch-example',\n", | ||
" train_instance_count=1,\n", | ||
" train_instance_type='ml.m4.xlarge',\n", | ||
" image_name=docker_image_name,\n", | ||
" entry_point=entry_point_script,\n", | ||
" framework_version='1.1.0',\n", | ||
" hyperparameters=hyperparameters,\n", | ||
" py_version='py3',\n", | ||
" debug=True,\n", | ||
" rules_specification=[\n", | ||
" {\n", | ||
" \"RuleName\": \"VanishingGradient\",\n", | ||
" \"InstanceType\": \"ml.c5.4xlarge\",\n", | ||
" \"VolumeSizeInGB\": 10,\n", | ||
" \"RuntimeConfigurations\": {\n", | ||
" \"start-step\": \"1\",\n", | ||
" \"end-step\": \"50\"\n", | ||
" }\n", | ||
" }\n", | ||
" ])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"estimator.fit()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Result\n", | ||
"\n", | ||
"As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`\n", | ||
"\n", | ||
"### Check the status of the Rule Execution Job\n", | ||
"To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"estimator.describe_rule_execution_jobs()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Receive CloudWatch Event For your Jobs\n", | ||
"When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus CloudWatch events are emitted : https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. You can configure a CW event rule to receive and process these events by setting up a target (Lambda function, SNS). \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"entry_point_script = 'simple.py'\n", | ||
"docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'\n", | ||
"bad_hyperparameters = {'epochs': 2,'tornasole_path':'/opt/ml/output/tensors', 'lr' : 1.0, 'momentum' : 0.9, 'tornasole-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sagemaker_execution_role = sagemaker.get_execution_role()\n", | ||
"#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\n", | ||
"estimator = PyTorch(role=sagemaker_execution_role,\n", | ||
" base_job_name='pytorch-example',\n", | ||
" train_instance_count=1,\n", | ||
" train_instance_type='ml.m4.xlarge',\n", | ||
" image_name=docker_image_name,\n", | ||
" entry_point=entry_point_script,\n", | ||
" framework_version='1.1.0',\n", | ||
" hyperparameters=bad_hyperparameters,\n", | ||
" py_version='py3',\n", | ||
" debug=True,\n", | ||
" rules_specification=[\n", | ||
" {\n", | ||
" \"RuleName\": \"VanishingGradient\",\n", | ||
" \"InstanceType\": \"ml.c5.4xlarge\",\n", | ||
" \"VolumeSizeInGB\": 10,\n", | ||
" \"RuntimeConfigurations\": {\n", | ||
" \"start-step\": \"1\",\n", | ||
" \"end-step\": \"10\"\n", | ||
" }\n", | ||
" }\n", | ||
" ])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"estimator.fit()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"estimator.describe_rule_execution_jobs()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "conda_pytorch_p36", | ||
"language": "python", | ||
"name": "conda_pytorch_p36" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.