-
Notifications
You must be signed in to change notification settings - Fork 0
/
practicals_1.qmd
350 lines (243 loc) · 10.2 KB
/
practicals_1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: "Hands-on Machine Learning: Medical Insurance Data"
description: "A practical exercise to hone your machine learning skills using real-world data."
author:
- name: Julien Fouret
email: [email protected]
output: html_document
---
# Introduction
This exercise immerses you in a real-world scenario.
- exploring the nuances of the dataset
- apply machine learning techniques
- gain a tangible feel for data manipulation and modeling.
- understand ML pipeline
# 1. Dataset Familiarization
## 1.1 Import and quick exploration
Your first task is to get familiar with the practical steps of data handling.
### Context of the dataset
Insurance companies leverage data to make informed decisions about policy pricing. Predicting future insurance charges for new subscriptions is vital. By understanding features like BMI (Body Mass Index), sex, and age, companies can adjust pricing strategies to stay competitive while managing risks and profitability.
Given our dataset, which likely represents insurance charges over a year, our goal is to predict these charges for potential new subscribers.
### Attributes:
The dataset we'll be working with has been sourced from Kaggle. It provides various attributes of individuals and their corresponding medical insurance charges.
[**Dataset Link**](https://www.kaggle.com/datasets/joebeachcapital/medical-insurance-costs/)
**License**: Open
- `age`: Age of the insured
- `sex`: Gender of the insured
- `bmi`: Body Mass Index
- `children`: Number of children/dependents
- `smoker`: Smoking status
- `region`: Residential region
- `charges`: Medical insurance cost
- (Optionally the datasets is available on e-campus.)
- **Load the Dataset into a colab session**
- Import the datasets
<details><summary>Hints</summary>
```python
import pandas as pd
data = pd.read_csv('insurance.csv')
```
</details>
### Dive into the dataset briefly to recognize its structure:
- **Data Types**
<details><summary>Hints</summary>
use the `dtypes` attribute
</details>
- **Dataset Size**
<details><summary>Hints</summary>
use the `shape` attribute
</details>
- **Column names**
<details><summary>Hints</summary>
use the `columns` attribute
</details>
- **Look at the first lines**
<details><summary>Hints</summary>
use the `.head()` attribute
</details>
- **Summarize the columns**
<details><summary>Hints</summary>
use the `describe()` method.
use the `include` argument with respect to dtypes that are not in the first output
</details>
- **Identify** the key features and the target variable:
<details><summary>**Target**</summary>
`charges`
</details>
<details><summary>**Discrete features**</summary>
`age`, `sex`, `children`, `smoker`, `region`
</details>
<details><summary>**Continuous features**</summary>
`bmi`
</details>
- What are the possible values for `children`, `region`, or `smoker` ?
## 2.2 Data Vizualisation
- Use dataviz techniques to explore potential relations between the target and features.
<details><summary>**Hints**</summary>
use a combination of `panda.melt` and `seaborn.relplot` for dataviz.
keep the target as id in the long format
use `kind="scatter"` in `relplot`
Free the plot scales with `facet_kws = {"sharex": False}`
Other relplot option of interest: `alpha`, `col_wrap`.
</details>
# 2. Basic Linear Modeling
## 2.1 Data Encoding
The linear model does not accept string as input. That is where encoding helps.
- **Encoding Categorical Variables**
<details><summary>**Hints**</summary>
use `pd.get_dummies`
decide whether to set `drop_first` `True` or `False`
</details>
- **Feature-Target Split**
Set up your `X` and `Y`.
<details><summary>**Hints**</summary>
Look at `drop` method for dataframes.
Panda DataFrame or Numpy Array is usually fine.
</details>
## 2.2 Model Training
Apply your knowledge of linear models to train on the dataset:
1. **Train the Model**
Use linear algebra techniques or ML libraries as you see fit.
<details><summary>**Hints**</summary>
Use linear algebra: `@` for matrix multiplication and `scipy.linalg.inv` to invert a matrix.
Alternatively setup a score function that take a vector of parameters as argument and use `scipy.optimize.minimize`
</details>
2. **Predictions**
Generate predictions based on the model you trained.
## 2.3 Diagnostics
Quickly gauge the performance of your model:
- **Inspect Residuals**
A histogram can provide insights.
<details><summary>**Hints**</summary>
Use `seaborn.histplot`
</details>
- **Evaluation the regression using the R2 score**
<details><summary>**Hints**</summary>
Use `sklear.metrics.r2_score`
</details>
- **Plot predicted values vs targets**
<details><summary>**Hints**</summary>
Use `seaborn.regplot` or `seaborn.scatterplot`
</details>
- **Use Dataviz to plot residuals vs features**
<details><summary>**Hints**</summary>
Use `seaborn.relplot` after `pd.melt`
The `hue` argument might be good.
</details>
# 3. Some Feature Engineering
## 3.1 A derived feature ?
Challenge yourself: Can you derive any new features that might be relevant for the model given your previous observations ?
<details><summary>**Hints**</summary>
How BMI is used in real life ?
Sometime features holds too much non-essential information and are better used being reduced and categorized.
</details>
## 3.2 Interaction Term
Given your observation is there an interaction that might be more of interest ? Considering all interactions at once might add too much complexity to the model.
<details><summary>**Hints**</summary>
Add the new feature that should be binary.
</details>
- Encoding the interaction
<details><summary>**Hints**</summary>
In linear regression an interaction between $x_1$ and $x_2$ is modelled as follows: $y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2$, $x_1 x_2$ being the interaction term and $\beta_3$ the associated parameter.
</details>
## 3.3 Retrain with the new model
- **Train the model and make predictions**
- **Evaluate the performance**
# 4. Leveraging the statsmodels package
## 4.1 OLS with Formula API
```
import statsmodels.formula.api as smf
results1 = smf.ols('charges ~ age + sex + bmi + children + smoker + region + obesity', data=df2).fit()
results1.summary()
```
## 4.2 Model selection
- add a model with the interaction term `+ obesity:smoker`
- Compare models with AIC, BIC and LRT test
<details><summary>**Hints**</summary>
compute the stat: `D = -2 * (results1.llf - results2.llf)`
compute the difference in complexity: `df = results2.df_model - results1.df_model`
compute the p-value as area under the pdf more extreme in comparison the the observed statistic.
use `scipy.stats.chi2`
"area under the pdf more extreme in comparison the the observed statistic" ==> 1-cdf
cdf: cumulative distribution function
pdf: probability distribution function
Do not forget that df is the agrument of the chi2 law here.
</details>
# 5. In-sample and out-sample errors
## 5.1 North vs South
- Split the dataset between south and north based on region.
- Train a model with south data
- Compare its performance using sum of squared error and R2 between the 2 datasets.
- What is the difference ? Comment.
## 5.2 Sample 1 vs Sample 2
- Do the same with 2 sample of the data frame
<details><summary>**Hints**</summary>
use the `sample()` method of the pd dataframe object.
</details>
- start with 2 sample of 100
- try, 20 and 50
- Each time look at the sum of squared errors and r2
- Comment.
# 6. Full ML workflow
- Familiarize yourself with the terms
- training dataset
- validation dataset
- test dataset
- adversarial test dataset
- Let us make a proper setup:
We expect `df2` to be the dataframe, not encoded with the new feature.
```python
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
df_adv = pd.get_dummies(df2[df2["region"] == "southeast"].drop(columns="region"), columns=['sex', 'smoker', "obesity"], drop_first=True)
df_trn = pd.get_dummies(df2[df2["region"] != "southeast"].drop(columns="region"), columns=['sex', 'smoker', "obesity"], drop_first=True)
X_adv = df_adv.drop(columns='charges')
y_adv = df_adv['charges']
# Splitting the data
X = df_trn.drop(columns='charges')
y = df_trn['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
Let's define a full ML pipeline
```python
# Create the preprocessing steps using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('poly', PolynomialFeatures(include_bias=False), X.columns)
])
# Create the pipeline with SequentialFeatureSelector
pipeline = Pipeline([
('preprocessor', preprocessor),
('feature_selection', SelectKBest(f_regression, k='all')),
('regressor', LinearRegression())
])
# Grid search including a parameter for number of features to select
param_grid = [
{
'regressor': [LinearRegression()],
'preprocessor__poly__degree': [], # TO COMPLETE
'feature_selection__k': [], # TO COMPLETE
},
{
'regressor': [], # TO COMPLETE find an alternative regressor model on sklearn
'preprocessor__poly__degree': [], # TO COMPLETE
'feature_selection__k': [5,10,"all"], # TO COMPLETE
}
]
```
- use `GridSearchCV` to find the best hyper parameters
<details><summary>**Hints**</summary>
Create the object and use the `fit` method with training dataset.
</details>
- Gather the best pipeline with the best parameters
<details><summary>**Hints**</summary>
best pipeline: `best_estimator_` attribute
best parameters: `best_estimator_` attribute
</details>
- Refit the best pipeline with the training datasets. And comment why retraining is necessary.
- Evaluate the prediction on the different datasets, training, Test, adversarial test with the MSE (mean squared error) and the R2 scores.