change: Pipeline Caching Example Notebook Improvements (aws#3640)

* feature: pipeline caching notebook example * change: initialize notebook * feature: pipeline caching notebook example and tuning notebook adjustment * fix: example notebook * change: README * fix: notebook code * fix: grammar * fix: more grammar * fix: pr syntax and remove dataset * fix: updated paths * fix: tuning notebook formatting * fix: more path corrections * feature: more commentary, notebook improvements * fix: grammar * fix: use present tense Co-authored-by: Brock Wade <[email protected]>
atqy · Oct 28, 2022 · a40b9f3 · a40b9f3
1 parent 6cfd848
commit a40b9f3
Show file tree

Hide file tree

Showing 4 changed files with 77 additions and 21 deletions.
diff --git a/sagemaker-pipelines/tabular/caching/artifacts/studio_cache_hit.png b/sagemaker-pipelines/tabular/caching/artifacts/studio_cache_hit.png
diff --git a/sagemaker-pipelines/tabular/caching/artifacts/studio_cache_hit_zoomed.png b/sagemaker-pipelines/tabular/caching/artifacts/studio_cache_hit_zoomed.png
diff --git a/sagemaker-pipelines/tabular/caching/artifacts/studio_cache_miss.png b/sagemaker-pipelines/tabular/caching/artifacts/studio_cache_miss.png
diff --git a/sagemaker-pipelines/tabular/caching/sagemaker-pipelines-caching-steps.ipynb b/sagemaker-pipelines/tabular/caching/sagemaker-pipelines-caching-steps.ipynb
@@ -30,7 +30,7 @@
    "metadata": {},
    "source": [
     "## A SageMaker Pipeline\n",
-    "The pipeline that you will create follows a shortened version of a typical ML pattern. In this notebook we will include just two steps - preprocessing and training."
+    "The pipeline in this notebook follows a shortened version of a typical ML pattern. Just two steps are included - preprocessing and training."
    ]
   },
   {
@@ -76,7 +76,7 @@
    "source": [
     "## Define Constants\n",
     "\n",
-    "Before you upload the data to an S3 bucket, gather some constants you can use later in this notebook."
+    "Before downloading the dataset, gather some constants you can use later in this notebook."
    ]
   },
   {
@@ -143,9 +143,7 @@
     "The parameters defined in this workflow include:\n",
     "\n",
     "* `processing_instance_count` - The instance count of the processing job.\n",
-    "* `instance_type` - The `ml.*` instance type of the training job.\n",
-    "* `model_approval_status` - The approval status to register with the trained model for CI/CD purposes (\"PendingManualApproval\" is the default).\n",
-    "* `mse_threshold` - The Mean Squared Error (MSE) threshold used to verify the accuracy of a model."
+    "* `instance_type` - The `ml.*` instance type of the training job."
    ]
   },
   {
@@ -162,11 +160,7 @@
     ")\n",
     "\n",
     "processing_instance_count = ParameterInteger(name=\"ProcessingInstanceCount\", default_value=1)\n",
-    "instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")\n",
-    "model_approval_status = ParameterString(\n",
-    "    name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n",
-    ")\n",
-    "mse_threshold = ParameterFloat(name=\"MseThreshold\", default_value=6.0)"
+    "instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")"
    ]
   },
   {
@@ -226,9 +220,9 @@
    "id": "a45e80ec",
    "metadata": {},
    "source": [
-    "Finally, we take the output of the processor's `run` method and pass that as arguments to the `ProcessingStep`. By passing the `pipeline_session` to the `sagemaker_session`, calling `.run()` does not launch the processing job, it returns a function call that will execute once the pipeline gets built, and create the arguments needed to run the job as a step in the pipeline.\n",
+    "Finally, we take the output of the processor's `run` method and pass that as arguments to the `ProcessingStep`. When passing a `pipeline_session` as the `sagemaker_session` parameter, this causes the `.run()` method to return a function call rather than launch a processing job. The function call executes once the pipeline gets built, and creates the arguments needed to run the job as a step in the pipeline.\n",
     "\n",
-    "Note the `\"train_data\"` and `\"test_data\"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step."
+    "Note the `\"train\"` and `\"validation\"`, and `\"test\"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step."
    ]
   },
   {
@@ -342,9 +336,9 @@
    "id": "d74c6821",
    "metadata": {},
    "source": [
-    "Finally, we use the output of the estimator's `.fit()` method as arguments to the `TrainingStep`. By passing the `pipeline_session` to the `sagemaker_session`, calling `.fit()` does not launch the training job, it returns a function call that will execute once the pipeline gets built, and create the arguments needed to run the job as a step in the pipeline.\n",
+    "Finally, we use the output of the estimator's `.fit()` method as arguments to the `TrainingStep`. When passing a `pipeline_session` as the `sagemaker_session` parameter, this causes the `.fit()` method to return a function call rather than launch the training job. The function call executes once the pipeline gets built, and creates the arguments needed to run the job as a step in the pipeline.\n",
     "\n",
-    "Pass in the `S3Uri` of the `\"train_data\"` output channel to the `.fit()` method. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object."
+    "Pass in the `S3Uri` of the `\"train\"` output channel to the `.fit()` method. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object."
    ]
   },
   {
@@ -395,8 +389,6 @@
     "    parameters=[\n",
     "        processing_instance_count,\n",
     "        instance_type,\n",
-    "        model_approval_status,\n",
-    "        mse_threshold,\n",
     "    ],\n",
     "    steps=[step_process, step_train],\n",
     ")"
@@ -417,7 +409,7 @@
    "id": "673480a3",
    "metadata": {},
    "source": [
-    "For example, you might check the `ProcessingInputs` of the pre-processing step. The Python SDK intentionally structures input code artifacts' S3 paths in order to optimize caching. Before input code files from the local file system are uploaded to S3, they are hashed, and the hash value is included in the S3 path. A pipeline and step path hierarchy is followed when constructing the entire S3Uri."
+    "For example, you might check the `ProcessingInputs` of the pre-processing step. The Python SDK intentionally structures input code artifacts' S3 paths in order to optimize caching - more explanation on this later in the notebook."
    ]
   },
   {
@@ -534,7 +526,15 @@
    "metadata": {},
    "source": [
     "## Caching Behavior\n",
-    "In the next part of the notebook, we'll observe both cache hit and cache miss scenarios."
+    "In the next part of the notebook, we observe both cache hit and cache miss scenarios. There are many parameters that are passed into SageMaker pipeline steps. Some directly influence the results of the corresponding SageMaker jobs such as the input data, while others describe how the job will run, for example an `instance_type`. When parameters from the first group are updated, a cache miss occurs and the step re-runs. When parameters from the second group are updated, a cache hit occurs and the step does not execute, as the job results are unaffected. In the following pipeline execution examples, parameters from both categories are updated and the effects of each one are observed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b287cd8",
+   "metadata": {},
+   "source": [
+    "There are many other parameters outside of these examples - for more information on how they affect caching, or for more information on how to opt in to or out of caching, please refer to the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html) and the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration). "
    ]
   },
   {
@@ -550,7 +550,55 @@
    "id": "24d93baf",
    "metadata": {},
    "source": [
-    "To verify whether or a cache hit or cache miss occurred for a particular step during a pipeline execution, open the SageMaker resources tab on the left. Click on Pipelines in the dropdown menu and find the \"AbaloneBetaPipelineCaching\" pipeline created in this notebook. Click on the pipeline in order to view the different executions tracked under that pipeline. You can click on each execution to view a graph of the steps and their behavior during that execution. In the graph, click on a step and then click on the \"information\" column to view the cache information."
+    "To verify whether a cache hit or miss occurred for a particular step during a pipeline execution, open the SageMaker resources tab on the left. Click on Pipelines in the dropdown menu and find the \"AbaloneBetaPipelineCaching\" pipeline created in this notebook. Click on the pipeline in order to view the different executions tracked under that pipeline. You can click on each execution to view a graph of the steps and their behavior during that execution. In the graph, click on a step and then click on the \"information\" column to view the cache information."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6b60061",
+   "metadata": {},
+   "source": [
+    "Here is an example of a cache hit in SageMaker Studio, displayed in the pane on the right side of the page:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0df8e0cf",
+   "metadata": {},
+   "source": [
+    "![\"studio cache hit image\"](artifacts/studio_cache_hit.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "adc61a47",
+   "metadata": {},
+   "source": [
+    "And here is an example of a cache miss in SageMaker Studio:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e79ea02d",
+   "metadata": {},
+   "source": [
+    "![\"studio cache miss image\"](artifacts/studio_cache_miss.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ecaf5b1f",
+   "metadata": {},
+   "source": [
+    "Information tab with cache hit result, enlarged:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fd313931",
+   "metadata": {},
+   "source": [
+    "![\"studio cache hit zoomed image\"](artifacts/studio_cache_hit_zoomed.png)"
    ]
   },
   {
@@ -659,7 +707,7 @@
    "id": "8a8d759d",
    "metadata": {},
    "source": [
-    "Update the pipeline and re-execute. The new execution will result in cache hits for both steps, as the `instance_type` parameter does not affect the cache."
+    "Update the pipeline and re-execute. The new execution results in cache hits for both steps, as the `instance_type` parameter does not affect the result of the jobs. SageMaker does not track this parameter when evaluating the cache for previous step executions, so it has no effect."
    ]
   },
   {
@@ -834,7 +882,15 @@
    "id": "06bbd71d",
    "metadata": {},
    "source": [
-    "Because input code artifacts and hyperparameters directly affect the job results, these attributes are tracked by the cache. This will result in cache misses during the next pipeline execution, and both steps will re-execute."
+    "Because input code artifacts and hyperparameters directly affect the job results, these attributes are tracked by SageMaker. This results in cache misses during the next pipeline execution, and both steps re-execute."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef19b698",
+   "metadata": {},
+   "source": [
+    "**Note**: When local data or code artifacts are passed in as parameters to pipeline steps, the Python SDK uses a specific path structure when uploading these artifacts to S3. The contents of code files and in some cases configuration files are hashed, and this hash is included in the S3 upload path (View the pipeline definition to see the path structure). Because SageMaker tracks the S3 paths of these artifacts when evaluating whether a step has already executed or not, this ensures that when a new local code or data file is provided, the SDK creates a new S3 upload path, a cache miss will occur, and the step will run again with the new data. For more information on the Python SDK's S3 path structures, see the [Python SDK docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#caching-configuration)."
    ]
   },
   {