-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
177 lines (135 loc) · 5.52 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
devtools::load_all()
library(reticulate) # for accessing python objects from r and vice-versa
library(insuranceData) # for the data
library(dplyr) # for the data manipulation
library(purrr)
```
```{python, include=FALSE}
import numpy as np
import pandas as pd
import shap
import sklearn.ensemble as sk
```
# mshap
<!-- badges: start -->
[![Travis build status](https://travis-ci.com/srmatth/mshap.svg?branch=main)](https://travis-ci.com/srmatth/mshap)
[![codecov](https://codecov.io/gh/srmatth/mshap/branch/main/graph/badge.svg?token=80MEJIXXX9)](https://codecov.io/gh/srmatth/mshap)
<!-- badges: end -->
The goal of mshap is to allow SHAP values for two-part models to be easily computed.
A two-part model is one where the output from one model is multiplied by the output from another model.
These are often used in the Actuarial industry, but have other use cases as well.
This package is designed in `R` with the example use cases having models and shap values calculated in python.
It is the hope that the interoperability between the two languages continues to grow, and the example here makes a strong case for the ease of transitioning between the two.
## Installation
Install mSHAP from CRAN with the following code:
```r
install.packages("mshap")
```
Or the development version from github with:
```r
# install.packages("devtools")
devtools::install_github("srmatth/mshap")
```
## Basic Use
We will demonstrate a simple use case on simulated data.
Suppose that we wish to be able to predict to total amount of money a consumer will spend on a subscription to a software product.
We might simulate 4 explanatory variables that looks like the following:
```{r}
## R
set.seed(16)
age <- runif(1000, 18, 60)
income <- runif(1000, 50000, 150000)
married <- as.numeric(runif(1000, 0, 1) > 0.5)
sex <- as.numeric(runif(1000, 0, 1) > 0.5)
# For the sake of simplicity we will have these as numeric already, where 0 represents male and 1 represents female
```
Now because this is a contrived example, we will knowingly set the response variables as follows (suppose here that `cost_per_month` is usage based, so as to be continuous):
```{r}
## R
cost_per_month <- (0.0006 * income - 0.2 * sex + 0.5 * married - 0.001 * age) + 10
num_months <- 15 * (0.001 * income * 0.001 * sex * 0.5 * married - 0.05 * age)^2
```
Thus, we have our data. We will combine the covariates into a single data frame for ease of use in python.
```{r}
## R
X <- data.frame(age, income, married, sex)
```
The end goal of this exercise is to predict the total revenue from the given customer, which mathematically will be `cost_per_month * num_months`.
Instead of multiplying these two vectors together initially, we will instead create two models: one to predict `cost_per_month` and the other to predict `num_months`. We can then multiply the output of the two models together to get our predictions.
We now move over to python to create our two models and predict on the training sets:
```{python}
## Python
X = r.X
y1 = r.cost_per_month
y2 = r.num_months
cpm_mod = sk.RandomForestRegressor(n_estimators = 100, max_depth = 10, max_features = 2)
cpm_mod.fit(X, y1)
nm_mod = sk.RandomForestRegressor(n_estimators = 100, max_depth = 10, max_features = 2)
nm_mod.fit(X, y2)
cpm_preds = cpm_mod.predict(X)
nm_preds = nm_mod.predict(X)
tot_rev = cpm_preds * nm_preds
```
We will now proceed to use TreeSHAP and subsequently mSHAP to explain the ultimate model predictions.
```{python}
## Python
# because these are tree-based models, shap.Explainer uses TreeSHAP to calculate
# fast, exact SHAP values for each model individually
cpm_ex = shap.Explainer(cpm_mod)
cpm_shap = cpm_ex.shap_values(X)
cpm_expected_value = cpm_ex.expected_value
nm_ex = shap.Explainer(nm_mod)
nm_shap = nm_ex.shap_values(X)
nm_expected_value = nm_ex.expected_value
```
```{r}
## R
final_shap <- mshap(
shap_1 = py$cpm_shap,
shap_2 = py$nm_shap,
ex_1 = py$cpm_expected_value,
ex_2 = py$nm_expected_value
)
head(final_shap$shap_vals)
final_shap$expected_value
```
As a check, you can see that the expected value for mSHAP is indeed the expected value of the model across the training data.
```{r}
## R
mean(py$tot_rev)
```
We now have calculated the mSHAP values for the multiplied model outputs! This will allow us to explain our final model.
The mSHAP package comes with additional functions that can be used to visualize SHAP values in R.
What is show here are the default outputs, but these functions return `{ggplot2}` objects that are easily customizable.
```{r, fig.width=5, fig.height=5,fig.align='center'}
## R
summary_plot(
variable_values = X,
shap_values = final_shap$shap_vals,
names = c("age", "income", "married", "sex") # this is optional, since X has column names
)
```
```{r, fig.width=5, fig.height=5,fig.align='center'}
## R
observation_plot(
variable_values = X[23,],
shap_values = final_shap$shap_vals[23,],
expected_value = final_shap$expected_value,
names = c("age", "income", "married", "sex")
)
```
For another, more complex, use case run `vignette("mshap")`.
For more examples and options for plotting, run `vignette("mshap_plots")`.
## Citations
- For more information about SHAP values in general, you can visit the [SHAP github page](https://github.com/slundberg/shap)
- If you use `{mshap}`, please cite [*mSHAP: SHAP Values for Two-Part Models*](https://arxiv.org/abs/2106.08990)