Skip to content

Commit

Permalink
review comments incorporated
Browse files Browse the repository at this point in the history
  • Loading branch information
vikeshpandey committed Aug 11, 2021
1 parent 27b8518 commit 03b6292
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 1,025 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Credit risk prediction and explainability with Amazon SageMaker

This example shows how to user SageMaker Clarify to run explainability jobs on a SageMaker hosted inference pipelines.
This example shows how to user SageMaker Clarify to run explainability jobs on a SageMaker hosted inference pipeline.

Below is the architecture diagram used in the solution:

Expand All @@ -10,12 +10,12 @@ Below is the architecture diagram used in the solution:
The notebook performs the following steps:

1. Prepare raw training and test data
2. Create a SageMakerProcessing job which performs preprocessing on the raw training data and also produce an SKlearn model which is reused while deployment.
3. Train an XGBoost model on the processed data using SageMaker's built-in XGBoost container.
4. Create a SageMaker Inference pipeline containing the SKlearn and XGBoost model in a series.
2. Create a SageMaker Processing job which performs preprocessing on the raw training data and also produces an SKlearn model which is reused for deployment.
3. Train an XGBoost model on the processed data using SageMaker's built-in XGBoost container
4. Create a SageMaker Inference pipeline containing the SKlearn and XGBoost model in a series
5. Perform inference by supplying raw test data
6. Set up and run explainability job powered by SageMaker Clarify
7. Use open source shap library to create summary and waterfall plots to understand the feature importance better.
7. Use open source shap library to create summary and waterfall plots to understand the feature importance better
8. Run bias analysis jobs
9. Clean up

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,14 +111,23 @@
"source": [
"### Download data\n",
"\n",
"First, __download__ the data [here](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29) and save it in the local folder.\n",
"First, __download__ the data and save it in the `data` folder.\n",
"\n",
"\n",
"$^{[2]}$ Ulrike Grömping\n",
"Beuth University of Applied Sciences Berlin\n",
"Website with contact information: https://prof.beuth-hochschule.de/groemping/."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"S3Downloader.download('s3://sagemaker-sample-files/datasets/tabular/uci_statlog_german_credit_data/SouthGermanCredit.asc', 'data')"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -298,7 +307,7 @@
"outputs": [],
"source": [
"training_data = pd.read_csv(\n",
" \"data/SouthGermanCredit.txt\",\n",
" \"data/SouthGermanCredit.asc\",\n",
" names=credit_columns,\n",
" header=0,\n",
" sep=r\" \",\n",
Expand All @@ -314,8 +323,7 @@
"metadata": {},
"source": [
"### Data inspection\n",
"Plotting histograms for the distribution of the different features is a good way to visualize the data. \\n\n",
"TBD : Let's plot a few of the features that can be considered _sensitive_. \n"
"Plotting histograms for the distribution of the different features is a good way to visualize the data. \n"
]
},
{
Expand Down Expand Up @@ -409,7 +417,7 @@
"source": [
"### Using SageMaker Processing jobs for preprocessing\n",
"\n",
"We will use Sagemaker Processing jobs to perform the preprocessing on the raw data. Sagemaker Processing provides prebuilt container for SKlearn which we will use here. We will output a sklearn model that can be used for preprocessing inference requests. "
"We will use SageMaker Processing jobs to perform the preprocessing on the raw data. SageMaker Processing provides prebuilt container for SKlearn which we will use here. We will output a sklearn model that can be used for preprocessing inference requests. "
]
},
{
Expand Down Expand Up @@ -581,7 +589,7 @@
"source": [
"## Create an Inference Pipeline\n",
"\n",
"We will be deploying a sagemaker inference pipeline which will:\n",
"We will be deploying a SageMaker inference pipeline which will:\n",
" 1. Accept raw data as input\n",
" 1. preprocess the data with the SKlearn model we built earlier\n",
" 1. Pass the output of the Sklearn model as an input to the XGBoost model automatically\n",
Expand Down Expand Up @@ -638,7 +646,7 @@
"- a custom `predict_fn` for running the transform over the inputs\n",
"- a custom `model_fn` for deserializing the model\n",
"\n",
"We will be using the default implementation of the `output_function` provided by sagemaker SKlearn container. To know more, check out: https://github.com/aws/sagemaker-scikit-learn-container\n",
"We will be using the default implementation of the `output_function` provided by SageMaker SKlearn container. To know more, check out: https://github.com/aws/sagemaker-scikit-learn-container\n",
"\n"
]
},
Expand Down Expand Up @@ -881,9 +889,9 @@
"\n",
"We are interested in explaining bad credit predictions. Hence, we would like the baseline choice to have E(x) closer to 1(belonging to the good credit class). \n",
"\n",
"We use the [mode](https://en.wikipedia.org/wiki/Mode_(statistics) statistic to create the baseline. The mode is a good choice for categorical variables. We observe that the model prediction for the baseline has a high probability for the good credit class and hence it satisfies our requirement for the baseline. \n",
"We use the [mode](https://en.wikipedia.org/wiki/Mode_(statistics)) statistic to create the baseline. The mode is a good choice for categorical variables. We observe that the model prediction for the baseline has a high probability for the good credit class and hence it satisfies our requirement for the baseline. \n",
"\n",
"For more information on selecting informative vs non-informative baselines, see the documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html)"
"For more information on selecting informative vs non-informative baselines, see [SHAP Baselines for Explainability ](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html)"
]
},
{
Expand Down Expand Up @@ -986,7 +994,7 @@
"source": [
"### Run SageMaker Clarify Explainability job\n",
"\n",
"All the configurations are in place. Let's start the explainability job. This will spin up an ephemeral sagemaker endpoint and perform inference and calculate explanations on that endpoint. It does not use any existing production endpoint deployments."
"All the configurations are in place. Let's start the explainability job. This will spin up an ephemeral SageMaker endpoint and perform inference and calculate explanations on that endpoint. It does not use any existing production endpoint deployments."
]
},
{
Expand Down Expand Up @@ -1277,15 +1285,6 @@
"#### SHAP explanation plot for a single bad credit ensemble prediction instance "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install matplotlib"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 03b6292

Please sign in to comment.