Location of our Fantasy Hockey draft code to teach data science.
Stay tuned! We will be updating this with our methodologies as well as updates as we get to them!
- Implementation of "live" team trading and maintenance so the robots can drop/add players as they see fit with a stream of data
- Implementation of a version of this where we take subject matter expert opinion into account as well
- Implementation of a version using machine learning rather than the current classical approach of constrained optimization
- Implementation of our own player performance estimations
- Add more human players
In this repository we make use of Markowitz Portfolio Theory in order to select the best "portfolio" of players at a given time. In the sections below we will follow the following notational conventions
- Vectors will be shown in lower-case bold latin characters
- Matrices will be shown in upper-case latin figures
- Constants will be shown with greek characters.
Markowitz portfolio theory defines the optimal portfolio to be the solution to
where
In the scope of maximization problems like in equation 1, we also have to consider that the set of players
In this case, the choices may change from fantasy league to fantasy league - but in this example the table below lays out the constraints in how we may choose players for our fantasy teams
Position | Minimum | Maximum |
---|---|---|
Centers | 2 | N/A |
Left/Right Wingers | 2/2 | N/A / N/A |
Defence | 4 | N/A |
Goalies | 2 | 3 |
Team Size | 17 |
Where from the table above, we notice that each team consists of 7 players, we need between 2 and 3 goalies, and we are required to have at least 2 centers, two of each left and right wingers, and 4 defence players. Mathematically with equation 1 we have, where
Here it is important to note that our final constraint in equation 2 denotes that the elements of our player vector
What equation 2 represents is a formal statement that the solution to the maximization problem should yield an "optimal" team, up to some risk parameter, which should see the maximum returns with respect to fantasy points.
You may have noticed that we have mentioned that the solution to equations 1 and 2 both rely on us to be able to have some metric
This is far from the best solution, and a more sports minded group would invest a great deal of time into finding the "perfect" estimates of the returns and covariance to also incorporate subject matter expertise, but from an exploration and education point of view, this approach is at least reasonable (well... to us... sports layman's), and has the advantage that we don't have to invest too much time in order to calculate this. Certainly - we can always revisit the returns vectors and covariance once we have a working solution.
However, to define these quantities formally, we have for the
where
where
What is important to note howeveer in this instance is that order is important . Here we have chosen to index by game number - but an important improvement could be made by indexing this matrix by each game played. This would allow us to view correlations between players across teams more accurately, and is a future step in this project.
One subtlety of the above approach is that in order to calculate covariance each player needs to play the same amount of games. In other words, we need to have 82 data points (the number of games in an NHL season) for each player we wish to include, for each season that we're including them. This is a bit of a problem, as it is incredibly rare for a player to play every game in a season. So this leaves us with an important question:
How do we deal with missing data?
In this case, much like when choosing how to measure returns, we also need to choose how to deal with this missing data. In a perfect world, we would "impute" these values - essentially sampling a players distribution and filling in missing values with data that should be statistically relevant. However, this approach has a few problems, one of them being very technical, and one being very practical.
The technical problem is of course, we have no idea how to properly sample this distribution. How do we sample this distribution? Certainly players do not act independently - a proper sampling would require a multidimensional distribution that treats the relationships between players, particularity those on the same line, and perhaps rivals appropriately for a realistic sample. On the practical side of things, we're looking for a set of players which gives us maximum returns - and a player that did not play gave us exactly zero returns. So, if a player doesn't play often - that's important to know. As we're interested in maximizing returns, we have chosen to zero fill all returns for every player when they have missed a game.
In this case, each player will have exactly 82 returns "measurements" throughout the season, set by us to zero for games they did not play. These measurements will allow us to both calculate our returns vectors
Equation 2 represents how to find the optimal team with no competition in a draft scenario - you will choose the best possible team because you can pick who you want, when you want. Unfortunately - this is not the case in a draft scenario. You get to pick one person at a time, and if someone else takes a player you're interested? Hope you have a back up plan. How do we change equation 2 such that we can account for this?
Well, it turns out that this is more difficult to write in a closed form, and we now have to create an algorithm that we can use to adjust and adapt to other people taking from our discrete set of players
$$\begin{aligned} \max_{\mathbf{x}} & ;; \mathbf{r}^T \mathbf{x} - \gamma \mathbf{x}^T \mathbf{Q} \mathbf{x} \ \text{subject to} & ;; 2 \leq \sum_{i \in \text{G}} \mathbf{x_i} \leq 3 \ & ;; \sum_{i \in \text{LW}} \mathbf{x_i} \geq 2 \ & ;; \sum_{i \in \text{RW}} \mathbf{x_i} \geq 2 \ & ;; \sum_{i \in \text{C}} \mathbf{x_i} \geq 2 \ & ;; \sum_{i \in \text{D}} \mathbf{x_i} \geq 4 \ & ;; \sum_{i \in \text{T}} \mathbf{x_i} = 17 \ & ;; \mathbf{x} \in \left{0 , 1 \right}^n \ & ;; \sum_{i \in T_c} \mathbf{x}i = \alpha \ & ;; \sum{i \in O_c} \mathbf{x}_i = 0. \end{aligned} ;;;;;;; (3)$$
Where here
$$ \sum_{i \in T_c} \mathbf{x}i = \alpha ;;;;;;; (4)$$ $$ \sum{i \in O_c} \mathbf{x}_i = 0 ;;;;;;; (5)$$
so we may reference them directly in algorithm (1) below.
Algorithm(draft): (1)
- Choose a value for risk tolerance
$\gamma$ - Decide on team size MAX
- WHILE $\alpha \neq$ MAX DO:
Where algorithm 1 requires us to solve the optimization problem (3) MAX times. This may seem wasteful as we're only choosing a single player each time, why solve the entire problem and generate what is essentially a new team every time?
The answer to that is not as straightforward as you may think, but primarily it is because a Markowitz style portfolio optimization is optimizing the entire set - we're looking for the players that will have the highest returns on average when they're "working together" that is - our optimal solution depends on our entire team, not just a single player.
Of course, we notice in step 2 of the main loop of algorithm (1) we have left unspecified how to choose the single player that we're keeping in this round of the draft. Of course, this is another one of those stages where we need to decide for ourselves how we may choose a single player to include in our team out of the entire optimal team we have in the entire process. That is discussed in the next section
There are many metrics by which we can choose a single player and add them to our teams set of players
- Choose an available player
$x_i$ whose average returns$r_i$ are highest from the solution of equation (3). - Choose an available player
$x_i$ whose RMS of returns is highest from the solution of equation (3)
These two options are the easiest to implement and worth exploring, however, with all the computational power at hand - we may also be interested in trying to optimize once again in order to find the best player to choose. For example, if we again use a Markowitz style optimization using only the players from our specific team
$$\underset{\mathbf{x}}{\text{argmax}} \left{; \max_\mathbf{x} ; ; \mathbf{r}^T \mathbf{x} -\beta \mathbf{x}^T \mathbf{Q} \mathbf{x} \right}{\mathbf{x,r, Q} \in T{C_k}}. ;;;;;; (6)$$
Where in equation (6) above, the player vector
Coming soon
Coming soon.
In this case, we have a slightly different problem to solve. Rather than solving several instances of an optimization problem using limited resources between players, here everyone is free to choose whatever player they want - subject to division, position, and a points value cap. Formally,
where
For the SportsNet contest, each player has a value of 1-4, and we have to assemble a team using 30 points or less, as well, we need to make sure we have players from each conference - represented in the additional constraints. Besides those new additions, this is actually an "easier" problem than the draft as we only have to choose one team (and then hope for the best). One thing that should be noted however, is that in this contest the point value system is different. Here, the points are described in the table below
Table 2: here is the point value system used to score each player in the Sportsnet fantasy draft
Goals | Assists | Wins | Shutout | |
---|---|---|---|---|
Centers | 1 | 1 | N/A | N/A |
Left/Right Wingers | 1 | 1 | N/A | N/A |
Defence | 1 | 1 | N/A | N/A |
Goalies | N/A | N/A | 2 | 1 |
Here we see that goalies are really only rewarded for winning a game, and a little extra if they win, as compared to our previous example.
In this case, as playoffs are elimination based, it is not favorable for us to choose players from teams that we may expect to lose. In this case, we introduced an artificial win bias into the points scoring system during the optimization by awarding each player an additional 1.5 points for each game they have won. The idea here is that our optimization will now be biased towards teams that tend to win more games, which is ideal for playoff hockey. What should be noted is that formally, we're updating our points vector for each player
where
where
Where our calculations of our mean returns vector and covariance terms are identical, however, now we are using our win-biased points instead.