This Project is showcases how to building a Cloud data platform for the reporting and analysis of Formula1 Motor Sport data using Azure Databricks
Azure Databricks is a unified set of tools for building, deploying, sharing and maintaining enterprise-grade data solutions at scale.
The Databricks Lakehouse Platform integrates with various cloud storages coupled with the management and deployment of cloud infrastructures on your behalf.
- Azure Data Lake Storage
- Azure Data Factory
- Azure Databricks
- Azure Key Vault
The requirements for this project is broken down into six different parts which are;
- Ingest all 8 files into Azure data lake.
- Ingested data must have the same schema applied.
- Ingested data must have audit columns.
- Ingested data must be stored in columnar format (i.e parquet).
- We must be able to analyze the ingested data via SQL.
- Ingestion Logic must be able to handle incremental load.
- Join the key information required for reporting to create a new table.
- Join the key information required for analysis to create a new table.
- Transformed tables must have audit columns.
- We must be able to analyze the transformed data via SQL.
- Transformed data must be stored in columnar format (i.e parquet).
- Transformation logic must be able to handle incremental load.
- We want to be able to know Driver Standings.
- We should be able to know Constructor Standings as well.
- We want to know the Dominant drivers.
- Dominant Teams.
- Visualize the Outputs.
- Create Databricks dashboards.
- Scheduled to run every sunday at 10pm.
- Ability to monitor pipelines.
- Ability to rerun failed pipelines.
- Ability to set up alerts on failures
- Ability to delete individual records
- Ability to see history and time travel
- Ability to roll back to a previous version
The first thing was to ingest the Data from the Ergast API using Azure Data Factory into ADLS Raw layer. This was followed by another ingestion and transformation from the Raw layer to the Processed layer using Databricks.
The data in this layer will have the schema applied as well as being stored in columnar formats i.e parquet. Partitions will also be created where applicable as well.
This data will also be converted from parquet to delta lake to meet some of the non-functional requirements.
The data in the processed layer is then transformed further to meet the business requirements into the Presentation layer. It will also be in parquet and delta lake format as well.
We then use databricks notebooks to analyze the data and create dashboards. We also connected Power BI for more sophisticated dashboard generation before finally scheduling the pipeline with Azure Data Factory Pipelines.
The first thing we did was to create three containers in our ADLS named raw, processed and presentation. These containers was then mounted to databricks using Azure Service principal and Databricks Secret scopes as shown below.
After mounting, we then ingest the datasets into to our raw container.
Next, we created a f1_raw database so that we can create external tables on top of the raw datasets inside the f1_raw database as seen below.
Note that we created external tables for all the raw datasets so that we can ustilize the DeltaLakeHouse Architecture of Databricks.
We also created f1_processed and f1_presentation databases which will house our managed tables from the files that will be transformed later,
Each file was then ingested and transformed from the raw layer to the processed layer. The files are of different file formats which are csv, json and some are in folders as well. This implies that they are ingested and transformed based on their different file formats as shown below.
First we ingest all the datasets based on their file formats and requirements.
We ingest the csv file followed by the json file respectively.
After injesting all the 8 files, we then transformed the datasets as required
Transofrmation of all the datasets follows the same approach as shown above.
It is worthy to note that we have 8 files to be ingested and transformed, 4 of those files are implemented using full loads while the other four are implemented using incremental loads. The way we handle full or bulk loads are different from how we handle incremental loads.
After transformation, we then write the data with full loads to the processed containers in parquet format using the saveAsTable syntax which also saves the data as a table in our f1_processed database.
For the incremental load, we have to approach it by utilizing the advantages of Databricks lakehouse architecture which allows us to append data by using the merge syntax incrementally.
We first define the merge statement as a function as shown below;
Then we load all the four incremental load data by calling the function as shown below
Now we have all our data in the processed layer and as well as managed tables in our f1_processed database as well. The next thing is to transform the data further to meet BI reporting inside the presentation layer while also creating tables in our f1_presentation database. We created the race_results, drivers and constructors standings which will allow us perform reporting to meet the reporting requirements as shown below.
To create the presentation data for our BI reporting, we read the various data needed for the transformation as shown;
Next we join them and selected the required columns as seen below;
After that, we then write the data to the presentation layer using the incremental load approach we used which will also create managed tables in the f1_presentation database that was created earlier.
The other presentation files followed the same approach as above.
We then analyzed the dataset to find out the dominant drivers and dominant teams over the years.
For a more sophisticated reporting, we could connect our databricks workspace to Power BI as well as shown below;
To meet up with our scheduling and orchestration requirements, we have to use ADF pipelines.
We created pipelines for the ingestion files and transformatipn files as shown below;
We then tied both pipelines in a master pipeline with trigger to automate the whole process as shown below;
In conclusion, we are able utilize the power of Azure Databricks LakeHouse Architecture to Extract data of different file formats, Transform them and Load them based on full and incrememtal loads.
- Linkedin: https://www.linkedin.com/in/joshua-gbegudu/
- Github: github.com/jaykay04
- Instagram: @Official_jaykay04