-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodel.Rmd
198 lines (163 loc) · 6.08 KB
/
model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "NHL Stanley Cup Playoff Predictions"
author: "Ilari Scheinin"
output: html_document
---
Previous: [2. Process data](process.html)
Next: [4. Make predictions](predict.html)
3. Train models
---------------
The summary statistics generated in the previous step are now combined with
outcomes of the playoffs series, and then used to train models. To facilitate
use of multiple types of statistical models, I am using the
[caret](https://github.com/topepo/caret)
[package](http://cran.r-project.org/web/packages/caret/), which provides a
unified interface for this purpose. And in order to use both CPU cores on my
laptop, I am using
[package doMC](http://cran.r-project.org/web/packages/doMC/).
```{r}
suppressMessages({
# caret uses package plyr, but whenever dplyr and plyr are both loaded,
# dplyr should be loaded after plyr. Hence plyr is loaded explicitly.
library(plyr)
library(dplyr)
library(caret)
library(doMC)
})
registerDoMC(cores=2)
```
Define a function to detect the winner of a series. It takes a vector of
winners of individual games and returns the overall winner.
```{r}
series_winner <- function(x) {
counts <- sort(table(x), decreasing=TRUE)
if ((length(counts) == 1) || (counts[1] > counts[2]))
return(names(counts)[1])
stop("I'm sorry, but I couldn't figure out the winner: ",
paste(x, collapse=", "))
}
```
Define a function to append game statistics to the outcome of a playoff series.
Its behavior can be controlled with the argument `which` between these three
options:
* `overall`: `away team's overall performance - home team's overall performance`
* `single`: `away team's away performance - home team's home performance`
* `both`: `away team's away performance - home team's home performance` and
`away team's home performance - home team's away performance`
(default, and the one I am using here)
```{r}
add_stats <- function(games, gamestats, which=c("both", "single", "overall")) {
which <- match.arg(which)
if (which == "overall") {
away <- left_join(games, gamestats[["overall"]],
by=c("season", awayteam="team"))
home <- left_join(games, gamestats[["overall"]],
by=c("season", hometeam="team"))
} else {
away <- left_join(games, gamestats[["away"]],
by=c("season", awayteam="team"))
home <- left_join(games, gamestats[["home"]],
by=c("season", hometeam="team"))
}
if (which == "both") {
away2 <- left_join(games, gamestats[["home"]],
by=c("season", awayteam="team"))
home2 <- left_join(games, gamestats[["away"]],
by=c("season", hometeam="team"))
}
games$goals <- away$goals - home$goals
games$shots <- away$shots - home$shots
games$faceoffs <- away$faceoffs - home$faceoffs
games$penalties <- away$penalties - home$penalties
games$pp <- away$pp - home$pk
games$pk <- away$pk - home$pp
if (which == "both") {
games$goals2 <- away2$goals - home2$goals
games$shots2 <- away2$shots - home2$shots
games$faceoffs2 <- away2$faceoffs - home2$faceoffs
games$penalties2 <- away2$penalties - home2$penalties
games$pp2 <- away2$pp - home2$pk
games$pk2 <- away2$pk - home2$pp
}
games
}
```
Load data and append summary statistics.
```{r}
load(file.path("source-data", "nhlscrapr-core.RData"))
rm(list=c("roster.master", "roster.unique"))
gamestats <- readRDS("processed.rds")
games <- tbl_df(games)
games <- games %>%
filter(status != 0, session == "Playoffs", season != "20142015") %>%
mutate(awayscore=as.integer(awayscore), homescore=as.integer(homescore))
playoffs <- games %>%
mutate(winner=ifelse(awayscore > homescore, awayteam, hometeam)) %>%
group_by(season, round=substring(gcode, 3, 3),
series=substring(gcode, 4, 4)) %>%
summarise(awayteam=first(awayteam), hometeam=first(hometeam),
winner=series_winner(winner)) %>%
ungroup() %>%
select(season, awayteam, hometeam, winner) %>%
mutate(winner=as.factor(ifelse(awayteam == winner, "away", "home")))
playoffs <- add_stats(playoffs, gamestats)
playoffs <- playoffs %>%
select(-season, -awayteam, -hometeam)
```
The training data looks like this:
```{r, eval=FALSE}
head(playoffs)
```
```{r, echo=FALSE, results='asis'}
knitr::kable(head(playoffs), digits=3)
```
Next, define parameters for the model training. I am preprocessing the data
with centering and scaling, and then using a 10-fold cross-validation repeated
10 times for parameter tuning. For each parameter, five different values are
evaluated via the cross-validation, and the combination with the best overall
accuracy chosen. A final model is then fitted with all of the training data and
the chosen parameter values.
In order to make the analysis reproducible, sets of random seeds are generated
to be used at each point of training.
```{r}
method <- "repeatedcv"
number <- 10
repeats <- 10
preProcess <- c("center", "scale")
tuneLength <- 5
metric <- "Accuracy"
maxParameters <- 5
seeds <- vector(mode="list", length=repeats*number+1)
for (i in seq_along(seeds))
seeds[[i]] <- (1000*i+1):(1000*i+1+tuneLength^maxParameters)
fitControl <- trainControl(method=method, number=number, repeats=repeats,
seeds=seeds)
```
Define a function that trains the models.
```{r}
train_model <- function(method) {
message("Training model: ", method, "...", appendLF=FALSE)
set.seed(7474505)
suppressMessages({ captured <- capture.output({
fit <- train(winner ~ ., data=playoffs,
method=method, trControl=fitControl, preProcess=preProcess,
metric=metric, tuneLength=tuneLength)
})})
message()
fit
}
```
Define which models we want to include, and then train them. I am including
generalized linear model, linear discriminant analysis, neural network, random
forests, and a support vector machine with a linear kernel. Each one of them
undergoes the cross-validation for parameter tuning, and a final model is built
with all of the training data.
Afterwards, save the resulting five final models.
```{r}
methods <- c("glm", "lda", "nnet", "rf", "svmLinear")
models <- lapply(methods, train_model)
names(models) <- methods
saveRDS(models, "models.rds")
```
Next: [4. Make predictions](predict.html)
Previous: [2. Process data](process.html)