Skip to content

Commit

Permalink
change: Pipeline Caching Example Notebook Improvements (aws#3640)
Browse files Browse the repository at this point in the history
* feature: pipeline caching notebook example

* change: initialize notebook

* feature: pipeline caching notebook example and tuning notebook adjustment

* fix: example notebook

* change: README

* fix: notebook code

* fix: grammar

* fix: more grammar

* fix: pr syntax and remove dataset

* fix: updated paths

* fix: tuning notebook formatting

* fix: more path corrections

* feature: more commentary, notebook improvements

* fix: grammar

* fix: use present tense

Co-authored-by: Brock Wade <[email protected]>
  • Loading branch information
2 people authored and atqy committed Oct 28, 2022
1 parent 6cfd848 commit a40b9f3
Show file tree
Hide file tree
Showing 4 changed files with 77 additions and 21 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"metadata": {},
"source": [
"## A SageMaker Pipeline\n",
"The pipeline that you will create follows a shortened version of a typical ML pattern. In this notebook we will include just two steps - preprocessing and training."
"The pipeline in this notebook follows a shortened version of a typical ML pattern. Just two steps are included - preprocessing and training."
]
},
{
Expand Down Expand Up @@ -76,7 +76,7 @@
"source": [
"## Define Constants\n",
"\n",
"Before you upload the data to an S3 bucket, gather some constants you can use later in this notebook."
"Before downloading the dataset, gather some constants you can use later in this notebook."
]
},
{
Expand Down Expand Up @@ -143,9 +143,7 @@
"The parameters defined in this workflow include:\n",
"\n",
"* `processing_instance_count` - The instance count of the processing job.\n",
"* `instance_type` - The `ml.*` instance type of the training job.\n",
"* `model_approval_status` - The approval status to register with the trained model for CI/CD purposes (\"PendingManualApproval\" is the default).\n",
"* `mse_threshold` - The Mean Squared Error (MSE) threshold used to verify the accuracy of a model."
"* `instance_type` - The `ml.*` instance type of the training job."
]
},
{
Expand All @@ -162,11 +160,7 @@
")\n",
"\n",
"processing_instance_count = ParameterInteger(name=\"ProcessingInstanceCount\", default_value=1)\n",
"instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")\n",
"model_approval_status = ParameterString(\n",
" name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n",
")\n",
"mse_threshold = ParameterFloat(name=\"MseThreshold\", default_value=6.0)"
"instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")"
]
},
{
Expand Down Expand Up @@ -226,9 +220,9 @@
"id": "a45e80ec",
"metadata": {},
"source": [
"Finally, we take the output of the processor's `run` method and pass that as arguments to the `ProcessingStep`. By passing the `pipeline_session` to the `sagemaker_session`, calling `.run()` does not launch the processing job, it returns a function call that will execute once the pipeline gets built, and create the arguments needed to run the job as a step in the pipeline.\n",
"Finally, we take the output of the processor's `run` method and pass that as arguments to the `ProcessingStep`. When passing a `pipeline_session` as the `sagemaker_session` parameter, this causes the `.run()` method to return a function call rather than launch a processing job. The function call executes once the pipeline gets built, and creates the arguments needed to run the job as a step in the pipeline.\n",
"\n",
"Note the `\"train_data\"` and `\"test_data\"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step."
"Note the `\"train\"` and `\"validation\"`, and `\"test\"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step."
]
},
{
Expand Down Expand Up @@ -342,9 +336,9 @@
"id": "d74c6821",
"metadata": {},
"source": [
"Finally, we use the output of the estimator's `.fit()` method as arguments to the `TrainingStep`. By passing the `pipeline_session` to the `sagemaker_session`, calling `.fit()` does not launch the training job, it returns a function call that will execute once the pipeline gets built, and create the arguments needed to run the job as a step in the pipeline.\n",
"Finally, we use the output of the estimator's `.fit()` method as arguments to the `TrainingStep`. When passing a `pipeline_session` as the `sagemaker_session` parameter, this causes the `.fit()` method to return a function call rather than launch the training job. The function call executes once the pipeline gets built, and creates the arguments needed to run the job as a step in the pipeline.\n",
"\n",
"Pass in the `S3Uri` of the `\"train_data\"` output channel to the `.fit()` method. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object."
"Pass in the `S3Uri` of the `\"train\"` output channel to the `.fit()` method. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object."
]
},
{
Expand Down Expand Up @@ -395,8 +389,6 @@
" parameters=[\n",
" processing_instance_count,\n",
" instance_type,\n",
" model_approval_status,\n",
" mse_threshold,\n",
" ],\n",
" steps=[step_process, step_train],\n",
")"
Expand All @@ -417,7 +409,7 @@
"id": "673480a3",
"metadata": {},
"source": [
"For example, you might check the `ProcessingInputs` of the pre-processing step. The Python SDK intentionally structures input code artifacts' S3 paths in order to optimize caching. Before input code files from the local file system are uploaded to S3, they are hashed, and the hash value is included in the S3 path. A pipeline and step path hierarchy is followed when constructing the entire S3Uri."
"For example, you might check the `ProcessingInputs` of the pre-processing step. The Python SDK intentionally structures input code artifacts' S3 paths in order to optimize caching - more explanation on this later in the notebook."
]
},
{
Expand Down Expand Up @@ -534,7 +526,15 @@
"metadata": {},
"source": [
"## Caching Behavior\n",
"In the next part of the notebook, we'll observe both cache hit and cache miss scenarios."
"In the next part of the notebook, we observe both cache hit and cache miss scenarios. There are many parameters that are passed into SageMaker pipeline steps. Some directly influence the results of the corresponding SageMaker jobs such as the input data, while others describe how the job will run, for example an `instance_type`. When parameters from the first group are updated, a cache miss occurs and the step re-runs. When parameters from the second group are updated, a cache hit occurs and the step does not execute, as the job results are unaffected. In the following pipeline execution examples, parameters from both categories are updated and the effects of each one are observed."
]
},
{
"cell_type": "markdown",
"id": "8b287cd8",
"metadata": {},
"source": [
"There are many other parameters outside of these examples - for more information on how they affect caching, or for more information on how to opt in to or out of caching, please refer to the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html) and the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration). "
]
},
{
Expand All @@ -550,7 +550,55 @@
"id": "24d93baf",
"metadata": {},
"source": [
"To verify whether or a cache hit or cache miss occurred for a particular step during a pipeline execution, open the SageMaker resources tab on the left. Click on Pipelines in the dropdown menu and find the \"AbaloneBetaPipelineCaching\" pipeline created in this notebook. Click on the pipeline in order to view the different executions tracked under that pipeline. You can click on each execution to view a graph of the steps and their behavior during that execution. In the graph, click on a step and then click on the \"information\" column to view the cache information."
"To verify whether a cache hit or miss occurred for a particular step during a pipeline execution, open the SageMaker resources tab on the left. Click on Pipelines in the dropdown menu and find the \"AbaloneBetaPipelineCaching\" pipeline created in this notebook. Click on the pipeline in order to view the different executions tracked under that pipeline. You can click on each execution to view a graph of the steps and their behavior during that execution. In the graph, click on a step and then click on the \"information\" column to view the cache information."
]
},
{
"cell_type": "markdown",
"id": "e6b60061",
"metadata": {},
"source": [
"Here is an example of a cache hit in SageMaker Studio, displayed in the pane on the right side of the page:"
]
},
{
"cell_type": "markdown",
"id": "0df8e0cf",
"metadata": {},
"source": [
"![\"studio cache hit image\"](artifacts/studio_cache_hit.png)"
]
},
{
"cell_type": "markdown",
"id": "adc61a47",
"metadata": {},
"source": [
"And here is an example of a cache miss in SageMaker Studio:"
]
},
{
"cell_type": "markdown",
"id": "e79ea02d",
"metadata": {},
"source": [
"![\"studio cache miss image\"](artifacts/studio_cache_miss.png)"
]
},
{
"cell_type": "markdown",
"id": "ecaf5b1f",
"metadata": {},
"source": [
"Information tab with cache hit result, enlarged:"
]
},
{
"cell_type": "markdown",
"id": "fd313931",
"metadata": {},
"source": [
"![\"studio cache hit zoomed image\"](artifacts/studio_cache_hit_zoomed.png)"
]
},
{
Expand Down Expand Up @@ -659,7 +707,7 @@
"id": "8a8d759d",
"metadata": {},
"source": [
"Update the pipeline and re-execute. The new execution will result in cache hits for both steps, as the `instance_type` parameter does not affect the cache."
"Update the pipeline and re-execute. The new execution results in cache hits for both steps, as the `instance_type` parameter does not affect the result of the jobs. SageMaker does not track this parameter when evaluating the cache for previous step executions, so it has no effect."
]
},
{
Expand Down Expand Up @@ -834,7 +882,15 @@
"id": "06bbd71d",
"metadata": {},
"source": [
"Because input code artifacts and hyperparameters directly affect the job results, these attributes are tracked by the cache. This will result in cache misses during the next pipeline execution, and both steps will re-execute."
"Because input code artifacts and hyperparameters directly affect the job results, these attributes are tracked by SageMaker. This results in cache misses during the next pipeline execution, and both steps re-execute."
]
},
{
"cell_type": "markdown",
"id": "ef19b698",
"metadata": {},
"source": [
"**Note**: When local data or code artifacts are passed in as parameters to pipeline steps, the Python SDK uses a specific path structure when uploading these artifacts to S3. The contents of code files and in some cases configuration files are hashed, and this hash is included in the S3 upload path (View the pipeline definition to see the path structure). Because SageMaker tracks the S3 paths of these artifacts when evaluating whether a step has already executed or not, this ensures that when a new local code or data file is provided, the SDK creates a new S3 upload path, a cache miss will occur, and the step will run again with the new data. For more information on the Python SDK's S3 path structures, see the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration)."
]
},
{
Expand Down

0 comments on commit a40b9f3

Please sign in to comment.