Skip to content

Commit

Permalink
incorporate feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
mchoi8739 committed Oct 7, 2020
1 parent ff111f7 commit 8b2585d
Showing 1 changed file with 5 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
"source": [
"# Detect Stalled Training and Stop Training Job Using SageMaker Debugger Rule\n",
" \n",
"This notebook guides you how to use the `StalledTrainingRule` built-in rule. This rule can take an action to stop your training job, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and save redundant resource usage.\n",
"This notebook guides you how to use the `StalledTrainingRule` built-in rule. This rule can take an action to stop your training job, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and reduces redundant resource usage.\n",
"\n",
"## How the StalledTrainingRule Built-in Rule Works\n",
"\n",
"Amazon Sagemaker Debugger captures tensors that you want to watch from training jobs on [AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers) or your local machine. If you use one of the Debugger-integrated Deep Learning Containers, you don't need to make any changes to your training script to use the functionality of built-in rules. For information about Debugger-supported SageMaker frameworks and versions, see [Debugger-supported framework versions for zero script change](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#zero-script-change). \n",
"\n",
"If you want to run a training script that uses partially supported framework by Debugger or your own custom container, you need to manually register the Debugger hook to your training script. The `smdebug` library provides tools to help the hook registration, and the sample script provided in the `src` folder includes the hook registration code as comment lines. If you want to explore using the manual hook registration, see the training script at `./src/simple_stalled_training.py`, and documentation at [smdebug TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md), [smdebug PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md), [smdebug MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md), and [smdebug XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md).\n",
"If you want to run a training script that uses partially supported framework by Debugger or your own custom container, you need to manually register the Debugger hook to your training script. The `smdebug` library provides tools to help the hook registration, and the sample script provided in the `src` folder includes the hook registration code as comment lines. For more information about how to manually register the Debugger hooks for this case, see the training script at `./src/simple_stalled_training.py`, and documentation at [smdebug TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md), [smdebug PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md), [smdebug MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md), and [smdebug XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md).\n",
"\n",
"The Debugger `StalledTrainingRule` watches tensor updates from your training job. If the rule doesn't find new tensors updated to the default S3 URI for a threshold period of time, it takes an action to trigger the `StopTrainingJob` API operation. The following code cells set up a SageMaker TensorFlow estimator with the Debugger `StalledTrainingRule` to watch the `losses` pre-built tensor collection."
]
Expand Down Expand Up @@ -56,8 +56,8 @@
"metadata": {},
"source": [
"### Create a unique training job prefix\n",
"The unique prefix must be specified for `StalledTrainingRule` to identify the exact training job name that you want to monitor and stop when the rule triggers the stalled training job issue.\n",
"If there are multiple training jobs sharing the same prefix, this rule may react to other training jobs. If the rule cannot find the exact training job name with a provided prefix, it will fallback to safe mode and not take action of stop the training job. Such rule evaluation process goes on in parallel while the training jobs are running. If you want to access the rule job logs, you will later find how to get the information at [Get a direct Amazon CloudWatch URL to find the current rule processing job log](#cw-url).\n",
"A unique prefix must be specified for `StalledTrainingRule` to identify the exact training job name that you want to monitor and stop when the rule triggers the stalled training job issue.\n",
"If there are multiple training jobs sharing the same prefix, this rule may react to other training jobs. If the rule cannot find the exact training job name with a provided prefix, it falls back to safe mode and does not stop the training job. The rule evaluation process goes on in parallel while the training jobs are running. If you want to access the rule job logs, you will later find how to get the information at [Get a direct Amazon CloudWatch URL to find the current rule processing job log](#cw-url).\n",
"\n",
"The following code cell includes:\n",
"* a code line to create a unique `base_job_name_prefix`\n",
Expand Down Expand Up @@ -119,7 +119,7 @@
"source": [
"## Monitoring Training and Rule Evaluation Status\n",
"\n",
"Once you excute the `estimator.fit()` API, SageMaker initiates a trining job in the background, and Debugger initiates a `StalledTrainingRule` rule evaluation job in parallel.\n",
"Once you execute the `estimator.fit()` API, SageMaker initiates a training job in the background, and Debugger initiates a `StalledTrainingRule` rule evaluation job in parallel.\n",
"Because the training scripts has a few lines of code at the end to force a sleep mode for 10 minutes, the `RuleEvaluationStatus` for `StalledTrainingRule` will change to `IssuesFound` in 2 minutes after the sleep mode is on and trigger the `StopTrainingJob` API."
]
},
Expand Down

0 comments on commit 8b2585d

Please sign in to comment.