review comments incorporated

aws · Aug 11, 2021 · 03b6292 · 03b6292
1 parent 27b8518
commit 03b6292
Show file tree

Hide file tree

Showing 4 changed files with 23 additions and 1,025 deletions.
diff --git a/sagemaker-clarify/clarify-explainability-inference-pipelines/README.md b/sagemaker-clarify/clarify-explainability-inference-pipelines/README.md
@@ -1,6 +1,6 @@
 ## Credit risk prediction and explainability with Amazon SageMaker
 
-This example shows how to user SageMaker Clarify to run explainability jobs on a SageMaker hosted inference pipelines. 
+This example shows how to user SageMaker Clarify to run explainability jobs on a SageMaker hosted inference pipeline. 
 
 Below is the architecture diagram used in the solution:
 
@@ -10,12 +10,12 @@ Below is the architecture diagram used in the solution:
 The notebook performs the following steps:
 
 1. Prepare raw training and test data
-2. Create a SageMakerProcessing job which performs preprocessing on the raw training data and also produce an SKlearn model which is reused while deployment.
-3. Train an XGBoost model on the processed data using SageMaker's built-in XGBoost container.
-4. Create a SageMaker Inference pipeline containing the SKlearn and XGBoost model in a series.
+2. Create a SageMaker Processing job which performs preprocessing on the raw training data and also produces an SKlearn model which is reused for deployment.
+3. Train an XGBoost model on the processed data using SageMaker's built-in XGBoost container
+4. Create a SageMaker Inference pipeline containing the SKlearn and XGBoost model in a series
 5. Perform inference by supplying raw test data
 6. Set up and run explainability job powered by SageMaker Clarify
-7. Use open source shap library to create summary and waterfall plots to understand the feature importance better.
+7. Use open source shap library to create summary and waterfall plots to understand the feature importance better
 8. Run bias analysis jobs
 9. Clean up
 

diff --git a/...y-explainability-inference-pipelines/credit_risk_explainability_inference_pipelines.ipynb b/...y-explainability-inference-pipelines/credit_risk_explainability_inference_pipelines.ipynb
@@ -111,14 +111,23 @@
    "source": [
     "### Download data\n",
     "\n",
-    "First,  __download__ the data [here](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29) and save it in the local folder.\n",
+    "First,  __download__ the data and save it in the `data` folder.\n",
     "\n",
     "\n",
     "$^{[2]}$ Ulrike Grömping\n",
     "Beuth University of Applied Sciences Berlin\n",
     "Website with contact information: https://prof.beuth-hochschule.de/groemping/."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "S3Downloader.download('s3://sagemaker-sample-files/datasets/tabular/uci_statlog_german_credit_data/SouthGermanCredit.asc', 'data')"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -298,7 +307,7 @@
    "outputs": [],
    "source": [
     "training_data = pd.read_csv(\n",
-    "    \"data/SouthGermanCredit.txt\",\n",
+    "    \"data/SouthGermanCredit.asc\",\n",
     "    names=credit_columns,\n",
     "    header=0,\n",
     "    sep=r\" \",\n",
@@ -314,8 +323,7 @@
    "metadata": {},
    "source": [
     "### Data inspection\n",
-    "Plotting histograms for the distribution of the different features is a good way to visualize the data. \\n\n",
-    "TBD : Let's plot a few of the features that can be considered _sensitive_.  \n"
+    "Plotting histograms for the distribution of the different features is a good way to visualize the data. \n"
    ]
   },
   {
@@ -409,7 +417,7 @@
    "source": [
     "### Using SageMaker Processing jobs for preprocessing\n",
     "\n",
-    "We will use Sagemaker Processing jobs to perform the preprocessing on the raw data. Sagemaker Processing provides prebuilt container for SKlearn which we will use here. We will output a sklearn model that can be used for preprocessing inference requests. "
+    "We will use SageMaker Processing jobs to perform the preprocessing on the raw data. SageMaker Processing provides prebuilt container for SKlearn which we will use here. We will output a sklearn model that can be used for preprocessing inference requests. "
    ]
   },
   {
@@ -581,7 +589,7 @@
    "source": [
     "## Create an Inference Pipeline\n",
     "\n",
-    "We will be deploying a sagemaker inference pipeline which will:\n",
+    "We will be deploying a SageMaker inference pipeline which will:\n",
     "  1. Accept raw data as input\n",
     "  1. preprocess the data with the SKlearn model we built earlier\n",
     "  1. Pass the output of the Sklearn model as an input to the XGBoost model automatically\n",
@@ -638,7 +646,7 @@
     "- a custom `predict_fn` for running the transform over the inputs\n",
     "- a custom `model_fn` for deserializing the model\n",
     "\n",
-    "We will be using the default implementation of the `output_function` provided by sagemaker SKlearn container. To know more, check out: https://github.com/aws/sagemaker-scikit-learn-container\n",
+    "We will be using the default implementation of the `output_function` provided by SageMaker SKlearn container. To know more, check out: https://github.com/aws/sagemaker-scikit-learn-container\n",
     "\n"
    ]
   },
@@ -881,9 +889,9 @@
     "\n",
     "We are interested in explaining bad credit predictions. Hence, we would like the baseline choice to have E(x) closer to 1(belonging to the good credit class). \n",
     "\n",
-    "We use the [mode](https://en.wikipedia.org/wiki/Mode_(statistics) statistic to create the baseline. The mode is a good choice for categorical variables. We observe that the model prediction for the baseline has a high probability for the good credit class and hence it satisfies our requirement for the baseline. \n",
+    "We use the [mode](https://en.wikipedia.org/wiki/Mode_(statistics)) statistic to create the baseline. The mode is a good choice for categorical variables. We observe that the model prediction for the baseline has a high probability for the good credit class and hence it satisfies our requirement for the baseline. \n",
     "\n",
-    "For more information on selecting informative vs non-informative baselines, see the documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html)"
+    "For more information on selecting informative vs non-informative baselines, see [SHAP Baselines for Explainability ](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html)"
    ]
   },
   {
@@ -986,7 +994,7 @@
    "source": [
     "### Run SageMaker Clarify Explainability job\n",
     "\n",
-    "All the configurations are in place. Let's start the explainability job. This will spin up an ephemeral sagemaker endpoint and perform inference and calculate explanations on that endpoint. It does not use any existing production endpoint deployments."
+    "All the configurations are in place. Let's start the explainability job. This will spin up an ephemeral SageMaker endpoint and perform inference and calculate explanations on that endpoint. It does not use any existing production endpoint deployments."
    ]
   },
   {
@@ -1277,15 +1285,6 @@
     "#### SHAP explanation plot for a single bad credit ensemble prediction instance "
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install matplotlib"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/...r-clarify/clarify-explainability-inference-pipelines/credit_risk_prediction.png b/...r-clarify/clarify-explainability-inference-pipelines/credit_risk_prediction.png