-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path09-multipleregression.Rmd
152 lines (106 loc) · 8.05 KB
/
09-multipleregression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Multiple regression
Last chapter, we looked at correlations and linear regression to predict how one element of a game would predict the score. But we know that a single variable, in all but the rarest instances, is not going to be that predictive. We need more than one. Enter multiple regression. Multiple regression lets us add -- wait for it -- multiple predictors to our equation to help us get a better fit to reality.
That presents it's own problems. So let's get set up. The dataset we'll use is all college football games since the 2011 season.
```{r echo=FALSE, class.output="bg-info", results="asis", message=FALSE, warning=FALSE}
library(downloadthis)
library(glue)
dllink <- download_link(
link = "http://mattwaite.github.io/sportsdatafiles/footballlogs1120.csv",
button_label = "Download csv file",
button_type = "danger",
has_icon = TRUE,
icon = "fa fa-save",
self_contained = FALSE
)
glue("<pre><p><strong>For this walkthrough:</strong></p><p>{dllink}</p></pre>")
```
We need the tidyverse.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
```
And the data.
```{r}
logs <- read_csv("data/footballlogs1120.csv")
```
One way to show how successful a footballl team was for a game is to show the differential between the team's score and the opponent's score. Score a lot more than the opponent = good, score a lot less than the opponent = bad. And, relatively speaking, the more the better. So let's create that differential. Let's also get our net yardage stat back. And because we'll need it later, let's add the turnover margin.
```{r}
logs <- logs %>% mutate(
Differential = TeamScore - OpponentScore,
NetYards = OffensiveYards - DefYards,
TurnoverMargin = DefTotalTurnovers - TotalTurnovers)
```
The linear model code we used before is pretty straight forward. Its `field` is predicted by `field`. Here's a simple linear model that looks at predicting a team's point differential by looking at their net yards.
```{r}
yards <- lm(Differential ~ NetYards, data=logs)
summary(yards)
```
Remember: There's a lot here, but only some of it we care about. What is the Adjusted R-squared value? What's the p-value and is it less than .05? In this case, we can predict 67 percent of the difference in differential with the net yardage in the game.
To add more predictors to this mix, we merely add them. But it's not that simple, as you'll see in a moment. So first, let's look at adding how well the other team shot to our prediction model:
```{r}
model1 <- lm(Differential ~ NetYards + TurnoverMargin, data=logs)
summary(model1)
```
First things first: What is the adjusted R-squared?
Second: what is the p-value and is it less than .05?
Third: Compare the residual standard error. We went from 13.12 to 10.55. The meaning of this is both really opaque and also simple -- by adding data, we reduced the amount of error in our model. Residual standard error is the total distance between what our model would predict and what we actually have in the data. So lots of residual error means the distance between reality and our model is wider. So the width of our predictive range in this example shrank while we improved the amount of the difference we could predict. That's good, and not always going to be the case.
One of the more difficult things to understand about multiple regression is the issue of multicollinearity. What that means is that there is significant correlation overlap between two variables -- the two are related to each other as well as to the target output -- and all you are doing by adding both of them is adding error with no real value to the R-squared. In pure statistics, we don't want any multicollinearity at all. Violating that assumption limits the applicability of what you are doing. So if we have some multicollinearity, it limits our scope of application to college football We can't say this will work for every football league and level everywhere. What we need to do is see how correlated each value is to each other and throw out ones that are highly co-correlated.
So to find those, we have to create a correlation matrix that shows us how each value is correlated to our outcome variable, but also with each other. We can do that in the `Hmisc` library. We install that in the console with `install.packages("Hmisc")`
```{r message=FALSE, warning=FALSE}
library(Hmisc)
```
We can pass in every numeric value to the Hmisc library and get a correlation matrix out of it, but since we have a large number of values -- and many of them character values -- we should strip that down and reorder them. So that's what I'm doing here. I'm saying give me differential first, and then columns 9-24, and then 26-41. Why the skip? There's a blank column in the middle of the data -- a remnant of the scraper I used.
```{r}
simplelogs <- logs %>% select_if(is.numeric) %>% select(-Game) %>% select(Differential, NetYards, TurnoverMargin, everything())
```
Before we proceed, what we're looking to do is follow the Differential column down, looking for correlation values near 1 or -1. Correlations go from -1, meaning perfect negative correlation, to 0, meaning no correlation, to 1, meaning perfect positive correlation. So we're looking for numbers near 1 or -1 for their predictive value. BUT: We then need to see if that value is also highly correlated with something else. If it is, we have a decision to make.
We get our correlation matrix like this:
```{r}
cormatrix <- rcorr(as.matrix(simplelogs))
cormatrix$r
```
Notice right away -- NetYards is highly correlated. But NetYards's also highly correlated with RushingYards, OffensiveYards and DefYards. And that makes sense: those things all feed into NetYards. Including all of these measures would be pointless -- they would add error without adding much in the way of predictive power.
> **Your turn**: What else do you see? What other values have predictive power and aren't co-correlated?
We can add more just by simply adding them. Let's add the average yard per play for both offense and defense. They're correlated to NetYards, but not as much as you might expect.
```{r}
model2 <- lm(Differential ~ NetYards + TurnoverMargin + DefAvg + OffenseAvg, data=logs)
summary(model2)
```
Go down the list:
What is the Adjusted R-squared now?
What is the p-value and is it less than .05?
What is the Residual standard error?
The final thing we can do with this is predict things. Look at our coefficients table. See the Estimates? We can build a formula from that, same as we did with linear regressions.
How does this apply in the real world? Let's pretend for a minute that you are Scott Frost, and you have a mess on your hands. Your job is to win conference titles. To do that, we need to know what attributes of a team should we emphasize. We can do that by looking at what previous Big Ten conference champions looked like.
So if our goal is to predict a conference champion team, we need to know what those teams did. Here's the regular season conference champions in this dataset.
```{r}
logs %>%
filter(Team == "Ohio State" & Season == 2020 | Team == "Ohio State" & Season == 2019 | Team == "Ohio State" & Season == 2018) %>%
summarise(
meanNetYards = mean(NetYards),
meanTurnoverMargin = mean(TurnoverMargin),
meanDefAvg = mean(DefAvg),
meanOffenseAvg = mean(OffenseAvg)
)
```
Now it's just plug and chug.
```{r}
(0.0547465*195.8824) + (3.8806793*0.6764706) + (-3.9374905*5.044118 ) + (3.9152803*6.908824) + 0.4960303
```
So a team with those numbers is going to average scoring 21 more points per game than their opponent. Sound like Ohio State in the last three years?
How does that compare to Nebraska this season?
```{r}
logs %>%
filter(
Team == "Nebraska" & Season == 2020
) %>%
summarise(
meanNetYards = mean(NetYards),
meanTurnoverMargin = mean(TurnoverMargin),
meanDefAvg = mean(DefAvg),
meanOffenseAvg = mean(OffenseAvg)
)
```
```{r}
(0.0547465*5) + (3.8806793*-1.375) + (-3.9374905*5.4375) + (3.9152803*5.5375) + 0.4960303
```
By this model, it predicted we would average being outscored by our opponents by 4.3 points over the season. Reality? We were outscored by 6.25 on average.