model.Rmd

---
title: "NHL Stanley Cup Playoff Predictions"
author: "Ilari Scheinin"
output: html_document
---

Previous: [2. Process data](process.html)  
Next: [4. Make predictions](predict.html)

3. Train models
---------------

The summary statistics generated in the previous step are now combined with
outcomes of the playoffs series, and then used to train models. To facilitate
use of multiple types of statistical models, I am using the
[caret](https://github.com/topepo/caret)
[package](http://cran.r-project.org/web/packages/caret/), which provides a
unified interface for this purpose. And in order to use both CPU cores on my
laptop, I am using
[package doMC](http://cran.r-project.org/web/packages/doMC/).

```{r}
suppressMessages({
  # caret uses package plyr, but whenever dplyr and plyr are both loaded,
  # dplyr should be loaded after plyr. Hence plyr is loaded explicitly.
  library(plyr)
  library(dplyr)
  library(caret)
  library(doMC)
})
registerDoMC(cores=2)
```

Define a function to detect the winner of a series. It takes a vector of
winners of individual games and returns the overall winner.

```{r}
series_winner <- function(x) {
  counts <- sort(table(x), decreasing=TRUE)
  if ((length(counts) == 1) || (counts[1] > counts[2]))
    return(names(counts)[1])
  stop("I'm sorry, but I couldn't figure out the winner: ",
    paste(x, collapse=", "))
}
```

Define a function to append game statistics to the outcome of a playoff series.
Its behavior can be controlled with the argument `which` between these three
options:

* `overall`: `away team's overall performance - home team's overall performance`
* `single`: `away team's away performance - home team's home performance`
* `both`: `away team's away performance - home team's home performance` and
  `away team's home performance - home team's away performance`
   (default, and the one I am using here)

```{r}
add_stats <- function(games, gamestats, which=c("both", "single", "overall")) {
  which <- match.arg(which)
  if (which == "overall") {
    away <- left_join(games, gamestats[["overall"]],
      by=c("season", awayteam="team"))
    home <- left_join(games, gamestats[["overall"]],
      by=c("season", hometeam="team"))
  } else {
    away <- left_join(games, gamestats[["away"]],
      by=c("season", awayteam="team"))
    home <- left_join(games, gamestats[["home"]],
      by=c("season", hometeam="team"))
  }

  if (which == "both") {
    away2 <- left_join(games, gamestats[["home"]],
      by=c("season", awayteam="team"))
    home2 <- left_join(games, gamestats[["away"]],
      by=c("season", hometeam="team"))
  }

  games$goals <- away$goals - home$goals
  games$shots <- away$shots - home$shots
  games$faceoffs <- away$faceoffs - home$faceoffs
  games$penalties <- away$penalties - home$penalties
  games$pp <- away$pp - home$pk
  games$pk <- away$pk - home$pp

  if (which == "both") {
    games$goals2 <- away2$goals - home2$goals
    games$shots2 <- away2$shots - home2$shots
    games$faceoffs2 <- away2$faceoffs - home2$faceoffs
    games$penalties2 <- away2$penalties - home2$penalties
    games$pp2 <- away2$pp - home2$pk
    games$pk2 <- away2$pk - home2$pp
  }

  games
}
```

Load data and append summary statistics.

```{r}
load(file.path("source-data", "nhlscrapr-core.RData"))
rm(list=c("roster.master", "roster.unique"))

gamestats <- readRDS("processed.rds")

games <- tbl_df(games)
games <- games %>%
  filter(status != 0, session == "Playoffs", season != "20142015") %>%
  mutate(awayscore=as.integer(awayscore), homescore=as.integer(homescore))

playoffs <- games %>%
  mutate(winner=ifelse(awayscore > homescore, awayteam, hometeam)) %>%
  group_by(season, round=substring(gcode, 3, 3),
    series=substring(gcode, 4, 4)) %>%
  summarise(awayteam=first(awayteam), hometeam=first(hometeam),
    winner=series_winner(winner)) %>%
  ungroup() %>%
  select(season, awayteam, hometeam, winner) %>%
  mutate(winner=as.factor(ifelse(awayteam == winner, "away", "home")))

playoffs <- add_stats(playoffs, gamestats)

playoffs <- playoffs %>%
  select(-season, -awayteam, -hometeam)

```

The training data looks like this:

```{r, eval=FALSE}
head(playoffs)
```

```{r, echo=FALSE, results='asis'}
knitr::kable(head(playoffs), digits=3)
```

Next, define parameters for the model training. I am preprocessing the data
with centering and scaling, and then using a 10-fold cross-validation repeated
10 times for parameter tuning. For each parameter, five different values are
evaluated via the cross-validation, and the combination with the best overall
accuracy chosen. A final model is then fitted with all of the training data and
the chosen parameter values.

In order to make the analysis reproducible, sets of random seeds are generated
to be used at each point of training.

```{r}
method <- "repeatedcv"
number <- 10
repeats <- 10
preProcess <- c("center", "scale")
tuneLength <- 5
metric <- "Accuracy"
maxParameters <- 5
seeds <- vector(mode="list", length=repeats*number+1)
for (i in seq_along(seeds))
  seeds[[i]] <- (1000*i+1):(1000*i+1+tuneLength^maxParameters)

fitControl <- trainControl(method=method, number=number, repeats=repeats,
  seeds=seeds)
```

Define a function that trains the models.

```{r}
train_model <- function(method) {
  message("Training model: ", method, "...", appendLF=FALSE)
  set.seed(7474505)
  suppressMessages({ captured <- capture.output({
    fit <- train(winner ~ ., data=playoffs,
      method=method, trControl=fitControl, preProcess=preProcess,
      metric=metric, tuneLength=tuneLength)
  })})
  message()
  fit
}
```

Define which models we want to include, and then train them. I am including
generalized linear model, linear discriminant analysis, neural network, random
forests, and a support vector machine with a linear kernel. Each one of them
undergoes the cross-validation for parameter tuning, and a final model is built
with all of the training data.

Afterwards, save the resulting five final models.

```{r}
methods <- c("glm", "lda", "nnet", "rf", "svmLinear")
models <- lapply(methods, train_model)
names(models) <- methods

saveRDS(models, "models.rds")
```

Next: [4. Make predictions](predict.html)  
Previous: [2. Process data](process.html)