diff --git a/examples/pytorch/sagemaker-notebooks/pytorch(1).ipynb b/examples/pytorch/sagemaker-notebooks/pytorch(1).ipynb new file mode 100644 index 0000000000..7f52fa3999 --- /dev/null +++ b/examples/pytorch/sagemaker-notebooks/pytorch(1).ipynb @@ -0,0 +1,312 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Debugging SageMaker Training Jobs with Tornasole" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tornasole is a new capability of Amazon SageMaker that allows debugging machine learning training. It lets you go beyond just looking at scalars like losses and accuracies during training and gives you full visibility into all tensors 'flowing through the graph' during training or inference.\n", + "\n", + "Using Tornasole is a two step process:\n", + "\n", + "### Saving tensors\n", + "\n", + "Tensors define the state of the training job at any particular instant in its lifecycle. Tornasole exposes a library which allows you to capture these tensors and save them for analysis\n", + "\n", + "### Analysis\n", + "\n", + "Analyses of the tensors emitted is captured by the Tornasole concept called ***Rules***. On a very broad level, Rules are a piece of analysis code that one writes to compares tensors across steps of a training job and analyze them in each step of the training job.\n", + "\n", + "The analysis of tensors saved requires the package `tornasole.rules`.\n", + "\n", + "This example guides through installation of the required components for emitting tensors in a SageMaker training job and applying a rule over the tensors to monitor the live status of the job." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "As a first step, we'll do the installation of required tools which will allow emission of tensors (saving tensors) and application of rules to analyze them" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-1.35.2.dev0.tar.gz .\n", + "!pip -q install sagemaker-1.35.2.dev0.tar.gz\n", + "!aws s3 cp s3://tornasole-external-preview-use1/sdk/sagemaker-tornasole.json .\n", + "!aws configure add-model --service-model sagemaker-tornasole.json --service-name sagemaker" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we've completed the setup, we're ready to spin off a training job with debugging enabled\n", + "\n", + "## Training with Script Mode\n", + "\n", + "We'll be training a TensorFlow model for Sentiment Analysis. This will be done using SageMaker TensorFlow 1.14 Container with Script Mode." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import boto3\n", + "import sagemaker\n", + "from sagemaker.pytorch import PyTorch" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Inputs\n", + "\n", + "Configuring the inputs for the training job" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "entry_point_script = 'simple.py'\n", + "docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'\n", + "#hyperparameters = {'epochs': 2, 'batch-size': 128, 'tornasole_path':'/opt/ml/output/tensors'}\n", + "#--lr .01 --momentum 0.9 --tornasole-frequency 3 --steps 10 --hook-type saveall --random-seed True,\n", + "hyperparameters = {'epochs': 2,'tornasole_path':'/opt/ml/output/tensors', 'lr' : 0.01, 'momentum' : 0.9, 'tornasole-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Parameters\n", + "Now we'll call the TensorFlow Estimator to kick off a training job. The new parameters in the Estimator to look out for are\n", + "\n", + "##### `debug` (bool)\n", + "This indicates that debugging should be enabled for the training job. Setting this as `True` would make Tornasole available for use with the job\n", + "\n", + "##### `rules_specification` (list[*dict*])\n", + "This is a list of python dictionaries, where each `dict` is of the following form:\n", + "```\n", + "{\n", + " \"RuleName\": # The name of the class implementing the Tornasole Rule interface. (required)\n", + " \"SourceS3Uri\": # S3 URI of the rule script containing the class in 'RuleName'. If left empty, it would look for the class in one of the First Party rules already provided to you by Amazon. If not, SageMaker will try to look for the rule class in the script\n", + " \"InstanceType\": # The ml instance type in which the rule evaluation should run\n", + " \"VolumeSizeInGB\": # The volume size to store the runtime artifacts from the rule evaluation\n", + " \"RuntimeConfigurations\": {\n", + " # Map defining the parameters required to instantiate the Rule class and invoke the rule\n", + " : \n", + " }\n", + "}\n", + "```\n", + "#### Storage\n", + "The tensors are, by default, stored in the S3 output path of the training job, under the folder **`/tensors-`**. This is done to ensure that we don't end up accidentally overwriting the tensors from a training job with the others. Rules evaluation require separation of the tensors paths to be evaluated correctly.\n", + "\n", + "If you don't provide an S3 output path to the estimator, SageMaker creates one for you as:\n", + "**`s3://sagemaker--/`**\n", + "\n", + "See the way we instantiate the estimator below" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Rule\n", + "\n", + "There are two ways to apply rules.\n", + "1. Use a 1P rule. Specify the RuleName with the 1P RuleName, and the rule will be automatically applied. Here we are uing **`VanishingGradient`**. Leave `SourceS3Uri` empty if a 1P rule is needed.\n", + "2. Use a custom rule script and specify the S3 location of the script in `SourceS3Uri`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Estimator" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sagemaker_execution_role = sagemaker.get_execution_role()\n", + "#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\n", + "estimator = PyTorch(role=sagemaker_execution_role,\n", + " base_job_name='pytorch-example',\n", + " train_instance_count=1,\n", + " train_instance_type='ml.m4.xlarge',\n", + " image_name=docker_image_name,\n", + " entry_point=entry_point_script,\n", + " framework_version='1.1.0',\n", + " hyperparameters=hyperparameters,\n", + " py_version='py3',\n", + " debug=True,\n", + " rules_specification=[\n", + " {\n", + " \"RuleName\": \"VanishingGradient\",\n", + " \"InstanceType\": \"ml.c5.4xlarge\",\n", + " \"VolumeSizeInGB\": 10,\n", + " \"RuntimeConfigurations\": {\n", + " \"start-step\": \"1\",\n", + " \"end-step\": \"50\"\n", + " }\n", + " }\n", + " ])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "estimator.fit()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Result\n", + "\n", + "As a result of the above command, SageMaker will spin off 2 training jobs for you - the first one being the job which produces the tensors to be analyzed and the second one, which evaluates or analyzes the rule you asked it to in `rules_specification`\n", + "\n", + "### Check the status of the Rule Execution Job\n", + "To get the rule execution job that SageMaker started for you, run the command below and it shows you the `RuleName`, `RuleStatus`, `FailureReason` if any, and `RuleExecutionJobArn`. If the tensors meets a rule evaluation condition, the rule execution job throws a client error with `FailureReason: RuleEvaluationConditionMet`. You can check the Cloudwatch Logstream `/aws/sagemaker/TrainingJobs` with `RuleExecutionJobArn`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "estimator.describe_rule_execution_jobs()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Receive CloudWatch Event For your Jobs\n", + "When the status of training job or rule execution job change (i.e. starting, failed), TrainingJobStatus CloudWatch events are emitted : https://docs.aws.amazon.com/sagemaker/latest/dg/cloudwatch-events.html. You can configure a CW event rule to receive and process these events by setting up a target (Lambda function, SNS). \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "entry_point_script = 'simple.py'\n", + "docker_image_name= '072677473360.dkr.ecr.us-west-2.amazonaws.com/tornasole-preprod-pytorch-1.1.0-cpu:latest'\n", + "bad_hyperparameters = {'epochs': 2,'tornasole_path':'/opt/ml/output/tensors', 'lr' : 1.0, 'momentum' : 0.9, 'tornasole-frequency' : 3, 'steps' : 10, 'hook-type' : 'saveall', 'random-seed' : True }\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sagemaker_execution_role = sagemaker.get_execution_role()\n", + "#sagemaker_execution_role = 'AmazonSageMaker-ExecutionRole-20190614T145575'\n", + "estimator = PyTorch(role=sagemaker_execution_role,\n", + " base_job_name='pytorch-example',\n", + " train_instance_count=1,\n", + " train_instance_type='ml.m4.xlarge',\n", + " image_name=docker_image_name,\n", + " entry_point=entry_point_script,\n", + " framework_version='1.1.0',\n", + " hyperparameters=bad_hyperparameters,\n", + " py_version='py3',\n", + " debug=True,\n", + " rules_specification=[\n", + " {\n", + " \"RuleName\": \"VanishingGradient\",\n", + " \"InstanceType\": \"ml.c5.4xlarge\",\n", + " \"VolumeSizeInGB\": 10,\n", + " \"RuntimeConfigurations\": {\n", + " \"start-step\": \"1\",\n", + " \"end-step\": \"10\"\n", + " }\n", + " }\n", + " ])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "estimator.fit()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "estimator.describe_rule_execution_jobs()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_pytorch_p36", + "language": "python", + "name": "conda_pytorch_p36" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/pytorch/scripts/pytorch_hook_demos.py b/examples/pytorch/scripts/pytorch_hook_demos.py index ab7b57f84d..25f1cfd407 100644 --- a/examples/pytorch/scripts/pytorch_hook_demos.py +++ b/examples/pytorch/scripts/pytorch_hook_demos.py @@ -8,8 +8,8 @@ import torch.optim as optim from torchvision import datasets, transforms from torch.autograd import Variable -from tornasole.pytorch.hook import * -from tornasole.pytorch.torch_collection import * +from tornasole.pytorch import * +from tornasole.pytorch import * import tornasole.pytorch as ts diff --git a/examples/pytorch/scripts/simple.py b/examples/pytorch/scripts/simple.py index 21728f3400..db2cf10e00 100644 --- a/examples/pytorch/scripts/simple.py +++ b/examples/pytorch/scripts/simple.py @@ -7,8 +7,7 @@ import torch.nn.functional as F import torch.optim as optim from torch.autograd import Variable -from tornasole.pytorch.hook import * -from tornasole.pytorch.torch_collection import * +from tornasole.pytorch import * class Net(nn.Module): @@ -20,88 +19,111 @@ def __init__(self): self.add_module('conv2', nn.Conv2d(20, 50, 5, 1)) self.add_module('relu1', nn.ReLU()) self.add_module('max_pool2', nn.MaxPool2d(2, stride=2)) - self.add_module('fc1', nn.Linear(4*4*50, 500)) + self.add_module('fc1', nn.Linear(4 * 4 * 50, 500)) self.add_module('relu2', nn.ReLU()) self.add_module('fc2', nn.Linear(500, 10)) - def forward(self, x): x = self.relu0(self.conv1(x)) x = self.max_pool(x) x = self.relu1(self.conv2(x)) x = self.max_pool2(x) - x = x.view(-1, 4*4*50) + x = x.view(-1, 4 * 4 * 50) x = self.relu2(self.fc1(x)) x = self.fc2(x) return F.log_softmax(x, dim=1) + # Create a tornasole hook. The initilization of hook determines which tensors # are logged while training is in progress. # Following function shows the default initilization that enables logging of # weights, biases and gradients in the model. -def create_tornasole_hook(output_dir, module=None, hook_type='saveall', save_steps=None): +def create_tornasole_hook(output_dir, module=None, hook_type='saveall', + save_steps=None): # Create a hook that logs weights, biases, gradients and inputs/ouputs of model if hook_type == 'saveall': - hook = TornasoleHook(out_dir=output_dir, save_config=SaveConfig(save_steps=save_steps), save_all=True) + hook = TornasoleHook(out_dir=output_dir, + save_config=SaveConfig(save_steps=save_steps), + save_all=True) elif hook_type == 'module-input-output': # The names of input and output tensors of a module are in following format # Inputs : _input_, and # Output : _output # In order to log the inputs and output of a module, we will create a collection as follows: assert module is not None - get_collection('l_mod').add_module_tensors(module, inputs=True, outputs=True) + get_collection('l_mod').add_module_tensors(module, inputs=True, + outputs=True) # Create a hook that logs weights, biases, gradients and inputs/outputs of model - hook = TornasoleHook(out_dir=output_dir, save_config=SaveConfig(save_steps=save_steps), - include_collections=['weights', 'gradients', 'bias','l_mod']) + hook = TornasoleHook(out_dir=output_dir, + save_config=SaveConfig(save_steps=save_steps), + include_collections=['weights', 'gradients', + 'bias', 'l_mod']) elif hook_type == 'weights-bias-gradients': save_config = SaveConfig(save_steps=save_steps) # Create a hook that logs ONLY weights, biases, and gradients hook = TornasoleHook(out_dir=output_dir, save_config=save_config) return hook + def train(model, device, optimizer, num_steps=500, save_steps=[]): model.train() count = 0 # for batch_idx, (data, target) in enumerate(train_loader): for i in range(num_steps): - batch_size=32 - data, target = torch.rand(batch_size, 1, 28, 28), torch.rand(batch_size).long() + batch_size = 32 + data, target = torch.rand(batch_size, 1, 28, 28), torch.rand( + batch_size).long() data, target = data.to(device), target.to(device) optimizer.zero_grad() - output = model(Variable(data, requires_grad = True)) + output = model(Variable(data, requires_grad=True)) loss = F.nll_loss(output, target) loss.backward() optimizer.step() -parser = argparse.ArgumentParser(description='PyTorch MNIST Example') -parser.add_argument('--batch-size', type=int, default=64, metavar='N', - help='input batch size for training (default: 64)') -parser.add_argument('--epochs', type=int, default=1, metavar='N', - help='number of epochs to train (default: 1)') -parser.add_argument('--lr', type=float, default=0.01, metavar='LR', - help='learning rate (default: 0.01)') -parser.add_argument('--momentum', type=float, default=0.9, metavar='M', - help='SGD momentum (default: 0.9)') -parser.add_argument('--tornasole-frequency', type=int, default=10, help='frequency with which to save steps') -parser.add_argument('--steps', type=int, default=100, help='number of steps') -parser.add_argument('--tornasole_path', type=str, help="output directory to save data in", default='./tornasole-testing/demo/') -parser.add_argument('--hook-type', type=str, choices=['saveall', 'module-input-output', 'weights-bias-gradients'], default='saveall') -parser.add_argument('--random-seed', type=bool, default=False) -args = parser.parse_args() +def main(): + parser = argparse.ArgumentParser(description='PyTorch MNIST Example') + parser.add_argument('--batch-size', type=int, default=64, metavar='N', + help='input batch size for training (default: 64)') + parser.add_argument('--epochs', type=int, default=1, metavar='N', + help='number of epochs to train (default: 1)') + parser.add_argument('--lr', type=float, default=0.01, metavar='LR', + help='learning rate (default: 0.01)') + parser.add_argument('--momentum', type=float, default=0.9, metavar='M', + help='SGD momentum (default: 0.9)') + parser.add_argument('--tornasole-frequency', type=int, default=10, + help='frequency with which to save steps') + parser.add_argument('--steps', type=int, default=100, + help='number of steps') + parser.add_argument('--tornasole_path', type=str, + help="output directory to save data in", + default='/opt/ml/output/tensors') + parser.add_argument('--hook-type', type=str, + choices=['saveall', 'module-input-output', + 'weights-bias-gradients'], default='saveall') + parser.add_argument('--random-seed', type=bool, default=False) + + args = parser.parse_args() + + if args.random_seed: + torch.manual_seed(2) + np.random.seed(2) + random.seed(12) + hook_type = 'saveall' + device = torch.device("cpu") + save_steps = [(i + 1) * args.tornasole_frequency for i in + range(args.steps // args.tornasole_frequency)] + model = Net().to(device) + hook = create_tornasole_hook(args.tornasole_path, model, hook_type, + save_steps=save_steps) + hook.register_hook(model) + optimizer = optim.SGD(model.parameters(), lr=args.lr, + momentum=args.momentum) + train(model, device, optimizer, num_steps=args.steps, + save_steps=save_steps) -if args.random_seed: - torch.manual_seed(2) - np.random.seed(2) - random.seed(12) -hook_type = 'saveall' -device = torch.device("cpu") -save_steps = [(i+1) * args.tornasole_frequency for i in range(args.steps//args.tornasole_frequency)] -model = Net().to(device) -hook = create_tornasole_hook(args.tornasole_path, model, hook_type, save_steps=save_steps) +if __name__ == '__main__': + main() -hook.register_hook(model) -optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) -train(model, device, optimizer, num_steps=args.steps, save_steps=save_steps) diff --git a/examples/pytorch/scripts/torch_resnet.py b/examples/pytorch/scripts/torch_resnet.py index d3bad58888..3c7e8a0c58 100644 --- a/examples/pytorch/scripts/torch_resnet.py +++ b/examples/pytorch/scripts/torch_resnet.py @@ -9,7 +9,7 @@ import torch.optim as optim from torchvision import datasets, transforms from torch.autograd import Variable -from tornasole.pytorch.hook import * +from tornasole.pytorch import * import time model_names = sorted(name for name in models.__dict__