Skip to content

Streaming data pipeline to extract and analyze real time football data from Twitter

License

Notifications You must be signed in to change notification settings

AlessandroMessori/Football-Twitter-Streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Football Twitter Streaming

This is a project for the Unstructured and Streaming Data Engineering Course from the M.Sc. of Computer Science and Engineering of Politecnico di Milano

It's a Data pipeline to analyze real time football data extracted from Twitter posts.

Architecture

The technology stack used for this project is the following:

  • Kafka and Confluent, as Streaming frameworks
  • Twitter Developer API and Confluent Connector, as data source
  • PySpark, to make processing queries on the Kafka topics
  • Elasticsearch and Kibana, for data storage and visualization respectively

alt text

Setup

This tutorial assumes you have Docker and Docker-Compose installed on your system, if you don't you can get them from https://www.docker.com/

Once you have Docker set up,you can clone the repository:

git clone https://github.com/AlessandroMessori/Football-Twitter-Streaming

Inside the cloned the repo, you need to replace the asterisks in the "twitter_connector_config" file in the config folder with your Twitter Developer credentials.

If you don't have them you can get a set here: https://developer.twitter.com/

To start the docker cluster cd into the cloned directory run the following command:

docker-compose up -d

Depending on your Docker installation you might need to use sudo.

After the command have been executed, you should now have a working Kafla cluster with a single Broker.

You can access the Confluent control center at http://localhost:9021/

For a guide on how to use the control center refer to https://docs.confluent.io/platform/current/control-center/index.html

You should now upload the configurations for the Twitter Source connector and the Elasticsearch Sink connector from the config file,you can do it in the Connectors section of the Control Center

alt text

You can access a Jupyter Notebook with PySpark installed and run all stream analysis queries at http://localhost:8888/ (access token: usde)

To visualize the results of the computation you can connect to Kibana at http://localhost:5601/.

In order to visualize the results you need to connect Kibana to Elastichsearch,by creating an index pattern for each of the Kafka topics. You can do so in Kibana in the Managment --> Stack Managment --> Index Patterns section.

alt text

You can create your own visualization or use the one I already configured,you can load it by uploading the file "config/dashboard.ndjson" in the Managment --> Stack Managment --> Saved Objects section of Kibana.

alt text

Limitations and Future Improvements

This project has the objective of counting the most popular football topics on Twitter; It does its job nicely but there are still a few problems that should be addressed in the future:

  • Multiword topics aren't considered (e.g. "Real Madrid","Manchester United")
  • Only topics corresponding to the most popular teams and players in the Fifa dataset are considered, other relevant topics such as coaches,staff members or competitions are at the moment not considered in the word count.
  • The number of hastags monitored it's limited, for now I only considered the word "football" translated in the languages of the most important european leagues (english,spanish,italian,french and german). It would be nice to find a way to monitor more hashtags without adding any bias to the data collection (for example if I started monitoring "#Juventus" the data distribution would be skewed towards players of this particular team).

About

Streaming data pipeline to extract and analyze real time football data from Twitter

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published