Skip to content

Linköping Hockey Analytics Conference - LINHAC 2022 | Given the event data, generate findings/patterns related to sequences of events leading up to a particular outcome.

Notifications You must be signed in to change notification settings

chayansraj/LINHAC-2022-Data-Science-Student-Competition

Repository files navigation

Linköping Hockey Analytics Conference - LINHAC 2022

Source: google

Competition Link - https://www.ida.liu.se/research/sportsanalytics/LINHAC/LINHAC22/studentcompetition.html

Task

Given the event data, generate findings/patterns related to sequences of events leading up to a particular outcome. You can choose the kind of outcome. For instance, find characteristics of sequences of events leading to, e.g., a goal or a successful zone entry or a shot.


My idea:

Temporal activity outcome prediction of players in Ice Hockey

Source: google

Temporal activity prediction is a challenging task in sports especially in sports like ice hockey because of its fast dynamics and interactions among the players. In this paper, I am trying to predict whether the next action of the players will be successful or not given the various parameters describing its location and other factors. Basically, it is the process of predicting if the player’s next action/move would be a success or not.

Data

The data was provided by Sportlogiq with permission of SHL, the Swedish Hockey League, representing event data from the 2020-2021 SHL season. It consists of 76041 rows and 22 features describing each game with a unique game id and time stamps. It consists of match time stamps between two teams, where every time stamp describes the event played by any one of the players in one of teams from a certain point on the field and whether it was successful or not. Keeping in mind the structure of the data, I have used a different idea of separating training, validation and test sets as different matches uniquely defined by their ‘gameid’. It means, if one match is used as a validation data and the other is used as testing data, then all the other matches are used as training data. This allows our model to analyze whole data space and extract complex patterns in each match. A summary dictionary is as below:

source: Author

Methodology

Exploratory data analysis

Firstly, an exhaustive exploratory data analysis was performed using one of the ‘gameid’ e.g. 66445. It consists of match time stamps between two teams encoded as 742 and 916, where every time stamp describes the event played by any one of the players in one of the teams from a certain point on the field and whether the action was successful or not.

Algorithms

During the initial step, I trained an ensemble of four Residual neural networks that loops over all the ‘gameid’ with a stride of 2. Technically, first ensemble works on ‘gameid’ 0 and 1, then next on 2 and 3 and so forth for test and validation sets respectively. We then average the accuracy of the 4 NN ensemble models and choose the games that produced highest accuracy. This is done because neural networks are powerful algorithms and can learn complex patterns with enough data, so we are feeding our data first to this ensemble to understand the best explainable data blocks for our next ensemble step.

source: google

Ensemble learning often proves to be performing superior to any one machine learning algorithm and hence final model chosen for this paper is again an ensemble of 4 very powerful classification algorithms, namely, K-Nearest Neighbors, Logistic Regression, Random Forests, Support Vector Machines. The structure of the setup looks like this:

source: Author


Tools

  1. Tensorflow (An end-to-end open source machine learning platform - https://www.tensorflow.org/)
  2. Keras (Python deep learning API - https://keras.io/)
  3. Optuna (Hyperparameter optimization framework to automate hyperparameter search - https://optuna.org/)
  4. SHAP (Interpretable Machine Learning Framework - https://github.com/slundberg/shap)
  5. sklearnex (Intel® Extension for Scikit-learn - https://intel.github.io/scikit-learn-intelex/)
  6. Graphviz (Open source graph visualization software - https://graphviz.org/ )

Result

The evaluation of the models is done with the evaluation metrics accuracy, precision, recall and F1-score. In our case, we have binary classification where accuracy shows the total correct classification out of total values, precision and recall capture the limitations of accuracy and considers the curse of imbalanced class labels. Finally, F1-score encapsulates and keeps the limitations of accuracy at bay and provides the best metric for our classifier which ranges from 0 to 1, higher the value, better the prediction.

source: Author

with final F1- Score of 0.85


Post optimizing the hyperparameter space, we evaluated our model on test data and achieved a weighted average F1 score of 0.85 which is ~3% higher than simply using residual neural networks. Attempting to interpret the feature influence on model output using SHAP gives an insight into important features for further analysis.

Importance of each feature: To get an overview of which features are important for a model we can plot the SHAP values of every feature for every sample.

source: Author


Contribution of each feature: The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.

source: Author

Thanks!

Source: Author

About

Linköping Hockey Analytics Conference - LINHAC 2022 | Given the event data, generate findings/patterns related to sequences of events leading up to a particular outcome.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published