- COVID WORKSHOP
- 1. Create the target bucket
- 2. ETL Process
- 3. Consult data from our data
- 4.Visualization with Power BI
Name: Yesid Leonardo López Sierra
Course: Cloud Computing
The following workshop teach you how to create an ETL, connect it to databricks and create visualizations using Power BI.
First, let's create our S3 bucket in AWS with the following name: covid-workshop
Let the other configuration as default and click on then create bucket
button.
Go to https://community.cloud.databricks.com/
Go to section and click on Create Cluster
with covid-cluster
as the name.
Go to Workspace and create a new notebook with:
- name: covid-notebook
- Language: python
- Cluster: covid-cluster (the created cluster)
In the created notebook add the following code to read the data source.
url = "https://storage.googleapis.com/covid19-open-data/v2/main.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv("file://"+SparkFiles.get("main.csv"), header=True, inferSchema= True)
then create a mount with the covid-workshop
bucket. For that, you need two variables ACCESS_KEY
and SECRET_KEY
(for that, you need to create a role in the IAM with full permissions for S3).
access_key = "[ACCESS_KEY]"
secret_key = "[SECRET_KEY]"
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "covid-workshop"
mount_name = "yelopezcovid"
dbutils.fs.mount(
source = "s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name),
mount_point = "/mnt/%s" % mount_name)
Finally, write the source dataset in the covid-workshop
bucket (the target dataset).
output = '/mnt/yelopezcovid/covid.csv'
df.write.csv(output)
When you check the target bucket there should be a folder with the following files:
Click here to check the Notebook
Before to create a query we need to create a table to execute the queries against it.
First, let's create the delta table:
archivo = "/mnt/yelopezcovid/delta/yelopezcovid/"
df.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(archivo)
and create the external table:
tabla = "CREATE TABLE covid USING DELTA LOCATION '/mnt/yelopezcovid/delta/yelopezcovid/'"
spark.sql(tabla)
And check that the table is created with
%sql
show tables;
The output should something like this:
Once the delta table is created, let's execute the following queries:
- ¿How many existing cases are around the world?
- ¿Which are the most affected countries?
- Identify the most critical points in the United States
- Death Rate
%sql
SELECT count(1) as CONTAGIOS_NIVEL_GLOBAL FROM covid
%sql
SELECT country_name, count(1) as CASES
FROM covid
GROUP BY country_name
ORDER BY CASES DESC
%sql
SELECT subregion1_name, count(1) as cases
FROM covid
WHERE country_code='US'
GROUP BY subregion1_name
ORDER BY cases DESC
%sql
SELECT country_name,
(SUM(new_deceased)/ IF(SUM(new_confirmed)=0,1,SUM(new_confirmed))) as MORTALIDAD
FROM covid
WHERE aggregation_level = 0
GROUP BY country_name
ORDER BY MORTALIDAD DESC
Click here to check the Notebook
First, let's the configuration to connect Power BI with Databricks. For that, go to Get Data
and type Spark
in the search bar.
Then, in the cluster configuration, go to configuration and click on the JDBC/ODBC
tab.
From the JDBC URL:
jdbc:spark://community.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/AAAAA/BBBBB-jays123
Create the server URL from the JDBC URL:
https://community.cloud.databricks.com:443sql/protocolv1/o/AAAAA/BBBBB-jays123
Select http
as protocol and click on Accept
.
Then Power BI will ask you for the credentials:
Add them and click on connect
. Finally import the covid table.
You can drag and drop the fields to create different charts as the following:
If you want to check better the chart click on this link