Final project for TC3041 - Advanced Databases, where I try to forecast the winner of Liga MX championship by using a soccer feed and analyzing sport experts tweets.
Juan Carlos Sanchez Cruz - A01631462
This project seeks to solve a simple question, who will be the winning the liga mx season 2018? In Mexico football is almost treated with political language and it is no coincidence that there are so many sports analysis programs. This sport is a round-the-media business. The live broadcasts of the matches are very expensive due to the number of people summoned. After every football game, the conversation turns to interpret the events and plays. But one of the things that draws my attention is the speculation that is created before, and that the Mexican League is very difficult to predict. Bets never fully favor one side, and there are many factors that can affect a team's performance.
What I'm looking for in this project is to make a prediction that uses historical data about the teams in the playoff and complement it with the positive or negative opinions on tweets of a selected group of sports analysts.
The data flows in the system as follows:
The numbers correspond to the chronological order in which the processes are developed. In the case of the two sources of data mining, this could be done in parallel.
There are two main sources for the data needed.
After looking for many sports feed that could give me the information needed in a JSON, I found Fantasy Data which is RESTful web API that provides updated historical data for free.
It took me 3 steps to gather the information.
- To search the roundId so I can query the right information from the API.
- Pull a JSON from the API that has relevant information of the teams from the league.
- Insert information to the database including the follow attributes:
- Table position
- Games played, wins, lost, and draws.
- Goals in favor and against.
- Last games streak.
To mine the data on Twitter, I used the Python library Tweepy which allows you to pull up to 200 tweets per user.
The information extracted was:
- The tweet id
- The creation date
- Username
- The full tweet text
As each tweet was pulled before it gets a JSON format, the full_text attribute is cleaned, by removing unnecessary jump spaces and by replacing " "
with ''
to avoid some format conflicts.
The combination of two types of data sources makes inference and prediction more robust. Our tangible data are nothing more than variables that serve to measure the success or failure of the football teams to examine.
This data does belong to the same scheme of the search, since it is information that directly affects the objects that we examine.
One of the main data that the betting houses use to calculate the probability of victory or defeat of a sporting event are the historical statistics of the teams involved. This method is used in the algorithm of this program, with the singularity that the opinions of expert analysts are also taken into account.
The necessary data will never be sufficient since there are other factors that, as small as they may be, can completely alter the outcome of a football match. However, the approach that is being given to the information used is, to some extent, good in order to achieve a justified inference.
As I tried to make the data the more consistent with the real data, I can not guarantee a high level of this, especially in the information obtained through Twitter, since the sentiment analysis was not very robust and may not reflect 100% the real opinion of the user.
The uniformity of the data exists, since the same algorithm is applied under the same measures to the teams that integrate the play offs.
MongoDB is a free and open source and document oriented database. I used it because it uses JSON files to save the information and this format is what the system receives when doing the data mining. It also has the advantages of being a non-relational database, such as being able to access the values of some attributes faster when there are large amounts of data.
To determine the odds of the teams to win the championship a number of variables were used. The next algorithm was used:
[(19 - position at league
) + games won
+ goal difference
+ (games won in the last 5 played
* 10)] * [(percentage of positive tweets
* 0.1) + 1]
To display the prediction result, I create a simple interface using elements from the Bootstrap Material Design Library. The first thing the user can see, is the question that this project is made about and then a colorful graph of the teams. The percentages used in this graph are obtained from the scores that the algorithm assigned to each computer and are stored as a separate attribute.
There is also a section of tweets where the most relevant of the official hashtag of the competition are shown.
You can see the outcome of ligamx-forecaster hosted with github pages. This version does not have a backend connection, and the data used is from the sample extracted on May 2nd, 2018.
This project only works for the Liga MX - Torneo Clausura 2018. It only works for sports analysts that tweets in Spanish. And works with a custom version environment.
This project definitely has several opportunities for improvement. More historical data could be mined to improve prediction and the feed could be connected to automatically update at real time. It could certainly improve the sentiment analysis to better interpret the opinions of analysts.
In this project:
-
Data: They are the numbers that together create the statistics of the teams. Also the text that each tweet represents.
-
Information: It is what the data really mean and its reflection in reality. For example, if a team goes bad, this is likely to be reflected in the statistics.
-
Knowledge: It is the inference we make from these data, that is, the values with which we weigh the probability of success or failure of each team.
-
Wisdom: Are the adjustments we make to the variables to judge with better accuracy. The opinions that a sports analyst may have are due to the experience in their field.
With no doubt, in this project I had to use all the skills and tools seen within the course. From the approach of a system that can be solved with the information in social networks, the extraction of the data along with its filtering and later use, and the use of a database that is scalable when saving large amounts of data.
You have to get your own keys first.
For the Fantasy Data API you need to register here to obtained a free key. Then you need to create a file feedKey.py
in the data-mining/api/
folder and add the variable api_key = '{api_key}'
.
For the twitter API you need to register your widget here to get your credentials. Then you need to create a file connection.py
in the data-mining/tweets/
folder and add the variables:
ckey = "{consumer_key}"
csecret = "{consumer_secret}"
atoken = "{access_token}"
asecret = "{access_token_secret}"
After pulling the data, you need to put the app
folder into a php5 apache2
server to run the interface. You need also to install the mongodb extension for php5. For macOS 10.12, the necessary driver is php56-mongodb which I install it using brew install php56-mongodb
or sudo pecl install mongodb
.
The data-pull can be tested using the appropriate argument in make
.
- MongoDB
- PHP5
- Python 3.6
- Tweepy
- Bootstrap Material Design
- To all sports analysts whose twitter accounts were used for this project.
- Rodolfo Rubén Alvarez, teacher of TC3041 course at ITESM.