-
Notifications
You must be signed in to change notification settings - Fork 69
/
03-linear-regression.qmd
269 lines (183 loc) · 9.69 KB
/
03-linear-regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# Linear Regression
```{r}
#| echo: false
set.seed(1234)
source("_common.R")
```
This lab will go over how to perform linear regression. This will include [simple linear regression](03-linear-regression.qmd#simple-linear-regression) and [multiple linear regression](03-linear-regression.qmd#multiple-linear-regression) in addition to how you can apply transformations to the predictors. This chapter will use [parsnip](https://www.tidymodels.org/start/models/) for model fitting and [recipes and workflows](https://www.tidymodels.org/start/recipes/) to perform the transformations.
## Libraries
We load tidymodels and ISLR and MASS for data sets.
```{r}
#| message: false
library(MASS) # For Boston data set
library(tidymodels)
library(ISLR)
```
## Simple linear regression
The `Boston` data set contains various statistics for 506 neighborhoods in Boston. We will build a simple linear regression model that related the median value of owner-occupied homes (`medv`) as the response with a variable indicating the percentage of the population that belongs to a lower status (`lstat`) as the predictor.
:::{.callout-important}
The `Boston` data set is quite outdated and contains some really unfortunate variables.
:::
We start by creating a parsnip specification for a linear regression model.
```{r}
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
```
While it is unnecessary to set the mode for a linear regression since it can only be regression, we continue to do it in these labs to be explicit.
The specification doesn't perform any calculations by itself. It is just a specification of what we want to do.
```{r}
lm_spec
```
Once we have the specification we can `fit` it by supplying a formula expression and the data we want to fit the model on.
The formula is written on the form `y ~ x` where `y` is the name of the response and `x` is the name of the predictors.
The names used in the formula should match the names of the variables in the data set passed to `data`.
```{r}
lm_fit <- lm_spec %>%
fit(medv ~ lstat, data = Boston)
lm_fit
```
The result of this fit is a parsnip model object. This object contains the underlying fit as well as some parsnip-specific information. If we want to look at the underlying fit object we can access it with `lm_fit$fit` or with
```{r}
lm_fit %>%
pluck("fit")
```
The `lm` object has a nice `summary()` method that shows more information about the fit, including parameter estimates and lack-of-fit statistics.
```{r}
lm_fit %>%
pluck("fit") %>%
summary()
```
We can use packages from the [broom](https://broom.tidymodels.org/) package to extract key information out of the model objects in tidy formats.
the `tidy()` function returns the parameter estimates of a `lm` object
```{r}
tidy(lm_fit)
```
and `glance()` can be used to extract the model statistics.
```{r}
glance(lm_fit)
```
Suppose that we like the model fit and we want to generate predictions, we would typically use the `predict()` function like so:
```{r}
#| error: true
predict(lm_fit)
```
But this produces an error when used on a parsnip model object. This is happening because we need to explicitly supply the data set that the predictions should be performed on via the `new_data` argument
```{r}
predict(lm_fit, new_data = Boston)
```
Notice how the predictions are returned as a tibble. This will always be the case for parsnip models, no matter what engine is used. This is very useful since consistency allows us to combine data sets easily.
We can also return other types of predicts by specifying the `type` argument. Setting `type = "conf_int"` return a 95% confidence interval.
```{r}
predict(lm_fit, new_data = Boston, type = "conf_int")
```
:::{.callout-note}
Not all engines can return all types of predictions.
:::
If you want to evaluate the performance of a model, you might want to compare the observed value and the predicted value for a data set. You
```{r}
bind_cols(
predict(lm_fit, new_data = Boston),
Boston
) %>%
select(medv, .pred)
```
You can get the same results using the `augment()` function to save you a little bit of typing.
```{r}
augment(lm_fit, new_data = Boston) %>%
select(medv, .pred)
```
## Multiple linear regression
The multiple linear regression model can be fit in much the same way as the [simple linear regression](03-linear-regression.qmd#simple-linear-regression) model. The only difference is how we specify the predictors. We are using the same formula expression `y ~ x`, but we can specify multiple values by separating them with `+`s.
```{r}
lm_fit2 <- lm_spec %>%
fit(medv ~ lstat + age, data = Boston)
lm_fit2
```
Everything else works the same. From extracting parameter estimates
```{r}
tidy(lm_fit2)
```
to predicting new values
```{r}
predict(lm_fit2, new_data = Boston)
```
A shortcut when using formulas is to use the form `y ~ .` which means; set `y` as the response and set the remaining variables as predictors. This is very useful if you have a lot of variables and you don't want to type them out.
```{r}
lm_fit3 <- lm_spec %>%
fit(medv ~ ., data = Boston)
lm_fit3
```
For more formula syntax look at `?formula`.
## Interaction terms
Adding interaction terms is quite easy to do using formula expressions. However, the syntax used to describe them isn't accepted by all engines so we will go over how to include interaction terms using recipes as well.
There are two ways on including an interaction term; `x:y` and `x * y`
- `x:y` will include the interaction between `x` and `y`,
- `x * y` will include the interaction between `x` and `y`, `x`, and `y`, i.e. it is short for `x:y + x + y`.
with that out of the way let expand `lm_fit2` by adding an interaction term
```{r}
lm_fit4 <- lm_spec %>%
fit(medv ~ lstat * age, data = Boston)
lm_fit4
```
note that the interaction term is named `lstat:age`.
Sometimes we want to perform transformations, and we want those transformations to be applied, as part of the model fit as a pre-processing step. We will use the recipes package for this task.
We use the `step_interact()` to specify the interaction term. Next, we create a workflow object to combine the linear regression model specification `lm_spec` with the pre-processing specification `rec_spec_interact` which can then be fitted much like a parsnip model specification.
```{r}
rec_spec_interact <- recipe(medv ~ lstat + age, data = Boston) %>%
step_interact(~ lstat:age)
lm_wf_interact <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(rec_spec_interact)
lm_wf_interact %>% fit(Boston)
```
Notice that since we specified the variables in the recipe we don't need to specify them when fitting the workflow object. Furthermore, take note of the name of the interaction term. `step_interact()` tries to avoid special characters in variables.
## Non-linear transformations of the predictors
Much like we could use recipes to create interaction terms between values are we able to apply transformations to individual variables as well. If you are familiar with the dplyr package then you know how to `mutate()` which works in much the same way using `step_mutate()`.
You would want to keep as much of the pre-processing inside recipes such that the transformation will be applied consistently to new data.
```{r}
rec_spec_pow2 <- recipe(medv ~ lstat, data = Boston) %>%
step_mutate(lstat2 = lstat ^ 2)
lm_wf_pow2 <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(rec_spec_pow2)
lm_wf_pow2 %>% fit(Boston)
```
You don't have to hand-craft every type of linear transformation since recipes have a bunch created already [here](https://recipes.tidymodels.org/reference/index.html#section-step-functions-individual-transformations) such as `step_log()` to take logarithms of variables.
```{r}
rec_spec_log <- recipe(medv ~ lstat, data = Boston) %>%
step_log(lstat)
lm_wf_log <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(rec_spec_log)
lm_wf_log %>% fit(Boston)
```
## Qualitative predictors
We will now turn our attention to the `Carseats` data set. We will attempt to predict `Sales` of child car seats in 400 locations based on a number of predictors. One of these variables is `ShelveLoc` which is a qualitative predictor that indicates the quality of the shelving location. `ShelveLoc` takes on three possible values
- Bad
- Medium
- Good
If you pass such a variable to `lm()` it will read it and generate dummy variables automatically using the following convention.
```{r}
Carseats %>%
pull(ShelveLoc) %>%
contrasts()
```
So we have no problems including qualitative predictors when using `lm` as the engine.
```{r}
lm_spec %>%
fit(Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
```
However, as with so many things, we can not always guarantee that the underlying engine knows how to deal with qualitative variables. Recipes can be used to handle this as well. The `step_dummy()` will perform the same transformation of turning 1 qualitative with `C` levels into `C-1` indicator variables.
While this might seem unnecessary right now, some of the engines, later on, do not handle qualitative variables and this step would be necessary. We are also using the `all_nominal_predictors()` selector to select all character and factor predictor variables. This allows us to select by type rather than having to type out the names.
```{r}
rec_spec <- recipe(Sales ~ ., data = Carseats) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(~ Income:Advertising + Price:Age)
lm_wf <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(rec_spec)
lm_wf %>% fit(Carseats)
```
## Writing functions
This book will not talk about how to write functions in R. If you still want to know how to write functions, we recommend the [Functions](https://r4ds.hadley.nz/functions.html) chapter of the [R for Data Science (2e)](https://r4ds.hadley.nz/).