-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm014_model_fitting_R.Rmd
211 lines (158 loc) · 5.94 KB
/
cm014_model_fitting_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: "cm014 Worksheet: The Model-Fitting Paradigm in R"
output: html_notebook
editor_options:
chunk_output_type: inline
---
```{r}
suppressPackageStartupMessages(library(tidyverse))
library(gapminder)
library(broom)
```
So you want to fit a model to your data. How can you achieve this with R?
Topics:
1. What _is_ model-fitting?
2. How do we fit a model in R?
3. How can we obtain tidy results from the model output?
## What is Model-Fitting?
When variables are not independent, then we can gain information about one variable if we know something about the other.
Examples: Use the scatterplot below:
1. A car weighs 4000 lbs. What can we say about its mpg?
2. A car weights less than 3000 lbs. What can we say about its mpg?
```{r, fig.width=5, fig.height=3}
library(tidyverse)
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
labs(x = "Weight (1000's of lbs)")
```
Example: What can we say about rear axle ratio if we know something about quarter mile time?
```{r, fig.width=5, fig.height=3}
ggplot(mtcars, aes(qsec, drat)) +
geom_point() +
labs(x = "Quarter mile time",
y = "Rear axle ratio")
```
If EDA isn't enough, we can answer these questions by fitting a model: a curve that predicts Y given X. Aka, a __regression curve__ or a __machine learning model__.
(There are more comprehensive models too, such as modelling entire distributions, but that's not what we're doing here)
There are typically two goals of fitting a model:
1. Make predictions.
2. Interpret variable relationships.
## Fitting a model in R
Model fitting methods tend to use a common format in R:
```
method(formula, data, options)
```
They also tend to have a common output: a special _list_.
__Method__:
A function such as:
- Linear Regression: `lm`
- Generalized Linear Regression: `glm`
- Local regression: `loess`
- Quantile regression: `quantreg::rq`
- ...
__Formula__:
In R, takes the form `y ~ x1 + x2 + ... + xp` (use column names in your data frame).
__Data__: The data frame.
__Options__: Specific to the method.
Exercise:
1. Fit a linear regression model to life expectancy ("Y") from year ("X") by filling in the formula. Notice what appears as the output.
2. On a new line, use the `unclass` function to uncover the object's true nature: a list. Note: it might be easier to use the `names` function to see what components are included in the list.
First, create a subset of the `gapminder` dataset containing only the country of `France
```{r}
gapminder_France <- gapminder %>%
filter(country=="France")
```
Now, using the `lm()` function we will create the linear model
```{r}
(my_lm <- lm(lifeExp~year,gapminder_France))
```
Does that mean that the life expectency at "year 0" was equal to -397.7646?!
We are interested in the modeling results around the modeling period which starts at year 1952. To get a meaniningful "interpretable" intercept we can use the `I()` function.
```{r}
(my_lm <- lm(lifeExp~I(year-1952),data=gapminder_France))
```
Use the `unclass()` function to take a look at how the `lm()` object actually looks like.
```{r}
unclass(my_lm)
```
To complicate things further, some info is stored in _another_ list after applying the `summary` function:
```{r}
summary(my_lm)
```
```{r}
lm_resid <- augment(my_lm)
lm_resid
ggplot(lm_resid,aes(.resid))+
geom_freqpoly(binwidth=0.5)
```
We can use the `predict()` function to make predictions from the model (default is to use fitting/training data). Here are the predictions:
```{r}
gapminder_France2 <- data.frame(year=seq(2000,2005))
predict(my_lm,newdata = gapminder_France2) %>%
head()
```
Or we can predict on a new dataset:
```{r}
years1 = gapminder_France2
predict(my_lm,years1)
```
```{r}
plot(my_lm)
```
We can plot models (with one predictor/ X variable) using `ggplot2` through the `geom_smooth()` layer. Specifying `method="lm"` gives us the linear regression fit (but only visually!):
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point() +
geom_smooth(method="lm") +
scale_x_log10()
```
Lets consider another country "Zimbabwe", which has a unique behavior in the `lifeExp` and `year` relationship.
```{r}
gapminder_Zimbabwe <- gapminder %>% filter(country=="Zimbabwe")
gapminder_Zimbabwe %>% ggplot(aes(year, lifeExp)) + geom_point()
```
Let's try fitting a linear model to this relationship
```{r}
ggplot(gapminder_Zimbabwe, aes(year,lifeExp)) + geom_point()+geom_smooth(method = "lm", se = F)
```
Now we will try to fit a second degree polynomial and see what would that look like.
```{r}
ggplot(gapminder_Zimbabwe,aes(year,lifeExp)) + geom_point()+geom_smooth(method="lm", formula=y~poly(I(x-1952),degree=2))
```
```{r}
lm_linear <- lm(data = gapminder,formula = FILL_THIS_IN)
lm_poly <- lm(data = gapminder,formula = FILL_THIS_IN))
```
`anova` lets you compare between different models.
```{r}
anova(lm_linear,lm_poly)
```
## Regression with categorical variables
```{r}
(lm_cat <- lm(gdpPercap ~ I(year - 1952) + continent, data = gapminder))
```
How did R know that continent was a categorical variable?
```{r}
class(gapminder$continent)
levels(gapminder$continent)
contrasts(gapminder$continent)
```
How can we change the reference level?
```{r}
gapminder$continent <- relevel(gapminder$continent, ref = "Oceania")
```
Let's build a new model
```{r}
lm_cat2 <- lm(gdpPercap ~ I(year - 1952) + continent, data = gapminder)
```
## Broom
Let's make it easier to extract info, using the `broom` package. There are three crown functions in this package, all of which input a fitted model, and outputs a tidy data frame.
1. `tidy`: extract statistical summaries about each component of the model.
- Useful for _interpretation_ task.
2. `augment`: add columns to the original data frame, giving information corresponding to each row.
- Useful for _prediction_ task.
3. `glance`: extract statistical summaries about the model as a whole (1-row tibble).
- Useful for checking goodness of fit.
Exercise: apply all three functions to our fitted model, `my_lm`. What do you see?
```{r}
```