Formula1_Big_Data_Project_Using_Azure_Databricks

This Project is showcases how to building a Cloud data platform for the reporting and analysis of Formula1 Motor Sport data using Azure Databricks

Azure Databricks is a unified set of tools for building, deploying, sharing and maintaining enterprise-grade data solutions at scale.

The Databricks Lakehouse Platform integrates with various cloud storages coupled with the management and deployment of cloud infrastructures on your behalf.

Azure Resources Used for this Project

Azure Data Lake Storage
Azure Data Factory
Azure Databricks
Azure Key Vault

Project Requirements

The requirements for this project is broken down into six different parts which are;

1. Data Ingestion Requirements

Ingest all 8 files into Azure data lake.
Ingested data must have the same schema applied.
Ingested data must have audit columns.
Ingested data must be stored in columnar format (i.e parquet).
We must be able to analyze the ingested data via SQL.
Ingestion Logic must be able to handle incremental load.

2. Data Transformation Requirements

Join the key information required for reporting to create a new table.
Join the key information required for analysis to create a new table.
Transformed tables must have audit columns.
We must be able to analyze the transformed data via SQL.
Transformed data must be stored in columnar format (i.e parquet).
Transformation logic must be able to handle incremental load.

3. Data Reporting Requirements

We want to be able to know Driver Standings.
We should be able to know Constructor Standings as well.

4. Data Analysis Requirements

We want to know the Dominant drivers.
Dominant Teams.
Visualize the Outputs.
Create Databricks dashboards.

5. Scheduling Requirements

Scheduled to run every sunday at 10pm.
Ability to monitor pipelines.
Ability to rerun failed pipelines.
Ability to set up alerts on failures

6. Other Non-Functional Requirements

Ability to delete individual records
Ability to see history and time travel
Ability to roll back to a previous version

Solution Architecture of this Project

The first thing was to ingest the Data from the Ergast API using Azure Data Factory into ADLS Raw layer. This was followed by another ingestion and transformation from the Raw layer to the Processed layer using Databricks.

The data in this layer will have the schema applied as well as being stored in columnar formats i.e parquet. Partitions will also be created where applicable as well.

This data will also be converted from parquet to delta lake to meet some of the non-functional requirements.

The data in the processed layer is then transformed further to meet the business requirements into the Presentation layer. It will also be in parquet and delta lake format as well.

We then use databricks notebooks to analyze the data and create dashboards. We also connected Power BI for more sophisticated dashboard generation before finally scheduling the pipeline with Azure Data Factory Pipelines.

Solution

The first thing we did was to create three containers in our ADLS named raw, processed and presentation. These containers was then mounted to databricks using Azure Service principal and Databricks Secret scopes as shown below.

After mounting, we then ingest the datasets into to our raw container.

Next, we created a f1_raw database so that we can create external tables on top of the raw datasets inside the f1_raw database as seen below. Note that we created external tables for all the raw datasets so that we can ustilize the DeltaLakeHouse Architecture of Databricks.

We also created f1_processed and f1_presentation databases which will house our managed tables from the files that will be transformed later,

Each file was then ingested and transformed from the raw layer to the processed layer. The files are of different file formats which are csv, json and some are in folders as well. This implies that they are ingested and transformed based on their different file formats as shown below.

First we ingest all the datasets based on their file formats and requirements. We ingest the csv file followed by the json file respectively.

After injesting all the 8 files, we then transformed the datasets as required

Transofrmation of all the datasets follows the same approach as shown above.

It is worthy to note that we have 8 files to be ingested and transformed, 4 of those files are implemented using full loads while the other four are implemented using incremental loads. The way we handle full or bulk loads are different from how we handle incremental loads.

After transformation, we then write the data with full loads to the processed containers in parquet format using the saveAsTable syntax which also saves the data as a table in our f1_processed database.

For the incremental load, we have to approach it by utilizing the advantages of Databricks lakehouse architecture which allows us to append data by using the merge syntax incrementally. We first define the merge statement as a function as shown below;

Then we load all the four incremental load data by calling the function as shown below

Now we have all our data in the processed layer and as well as managed tables in our f1_processed database as well. The next thing is to transform the data further to meet BI reporting inside the presentation layer while also creating tables in our f1_presentation database. We created the race_results, drivers and constructors standings which will allow us perform reporting to meet the reporting requirements as shown below.

To create the presentation data for our BI reporting, we read the various data needed for the transformation as shown;

Next we join them and selected the required columns as seen below;

After that, we then write the data to the presentation layer using the incremental load approach we used which will also create managed tables in the f1_presentation database that was created earlier. The other presentation files followed the same approach as above.

We then analyzed the dataset to find out the dominant drivers and dominant teams over the years.

Dominant Drivers Analysis and Visualization

Dominant Teams Analysis and Visualization

For a more sophisticated reporting, we could connect our databricks workspace to Power BI as well as shown below;

To meet up with our scheduling and orchestration requirements, we have to use ADF pipelines. We created pipelines for the ingestion files and transformatipn files as shown below;

We then tied both pipelines in a master pipeline with trigger to automate the whole process as shown below;

In conclusion, we are able utilize the power of Azure Databricks LakeHouse Architecture to Extract data of different file formats, Transform them and Load them based on full and incrememtal loads.

Follow Me

Linkedin: https://www.linkedin.com/in/joshua-gbegudu/
Github: github.com/jaykay04
Instagram: @Official_jaykay04

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Analysis & Visualization Script		Analysis & Visualization Script
Config & Functions Script		Config & Functions Script
Images		Images
Ingest & Transform script		Ingest & Transform script
Presentation Script		Presentation Script
Setup Script		Setup Script
datasets		datasets
incremental_load_data		incremental_load_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula1_Big_Data_Project_Using_Azure_Databricks

Azure Resources Used for this Project

Project Requirements

1. Data Ingestion Requirements

2. Data Transformation Requirements

3. Data Reporting Requirements

4. Data Analysis Requirements

5. Scheduling Requirements

6. Other Non-Functional Requirements

Solution Architecture of this Project

Solution

Dominant Drivers Analysis and Visualization

Dominant Teams Analysis and Visualization

Follow Me

About

Releases

Packages

Languages

jaykay04/Formula1_Big_Data_Project_Using_Azure_Databricks

Folders and files

Latest commit

History

Repository files navigation

Formula1_Big_Data_Project_Using_Azure_Databricks

Azure Resources Used for this Project

Project Requirements

1. Data Ingestion Requirements

2. Data Transformation Requirements

3. Data Reporting Requirements

4. Data Analysis Requirements

5. Scheduling Requirements

6. Other Non-Functional Requirements

Solution Architecture of this Project

Solution

Dominant Drivers Analysis and Visualization

Dominant Teams Analysis and Visualization

Follow Me

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages