Table of Content

COVID WORKSHOP
1. Create the target bucket
2. ETL Process
3. Consult data from our data
- 3.1 External table creation
- 3.2 Executing Queries
4.Visualization with Power BI
- 4.1 Connect Power BI with Databricks
- 4.2 Charts

COVID WORKSHOP

Name: Yesid Leonardo López Sierra
Course: Cloud Computing

The following workshop teach you how to create an ETL, connect it to databricks and create visualizations using Power BI.

1. Create the target bucket

First, let's create our S3 bucket in AWS with the following name: covid-workshop

Let the other configuration as default and click on then create bucket button.

2. ETL Process

Go to https://community.cloud.databricks.com/

2.1 Create a cluster

Go to section and click on Create Cluster with covid-cluster as the name.

2.2 Create a notebook

Go to Workspace and create a new notebook with:

name: covid-notebook
Language: python
Cluster: covid-cluster (the created cluster)

2.3 Add ETL process

In the created notebook add the following code to read the data source.

url = "https://storage.googleapis.com/covid19-open-data/v2/main.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

df = spark.read.csv("file://"+SparkFiles.get("main.csv"), header=True, inferSchema= True)

then create a mount with the covid-workshop bucket. For that, you need two variables ACCESS_KEY and SECRET_KEY (for that, you need to create a role in the IAM with full permissions for S3).

access_key = "[ACCESS_KEY]"
secret_key = "[SECRET_KEY]"
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "covid-workshop"
mount_name = "yelopezcovid"

dbutils.fs.mount(
  source = "s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), 
  mount_point = "/mnt/%s" % mount_name)

Finally, write the source dataset in the covid-workshop bucket (the target dataset).

output = '/mnt/yelopezcovid/covid.csv'
df.write.csv(output)

When you check the target bucket there should be a folder with the following files:

Click here to check the Notebook

3. Consult data from our data

Before to create a query we need to create a table to execute the queries against it.

3.1 External table creation

First, let's create the delta table:

archivo = "/mnt/yelopezcovid/delta/yelopezcovid/"
df.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(archivo)

and create the external table:

tabla = "CREATE TABLE covid USING DELTA LOCATION '/mnt/yelopezcovid/delta/yelopezcovid/'"
spark.sql(tabla)

And check that the table is created with

%sql
show tables;

The output should something like this:

3.2 Executing Queries

Once the delta table is created, let's execute the following queries:

¿How many existing cases are around the world?
¿Which are the most affected countries?
Identify the most critical points in the United States
Death Rate

¿How many existing cases are around the world?

%sql
SELECT count(1) as CONTAGIOS_NIVEL_GLOBAL FROM covid

¿Which are the most affected countries?

%sql
SELECT country_name, count(1) as CASES 
FROM covid
GROUP BY country_name
ORDER BY CASES DESC

Identify the most critical points in the United States

%sql
SELECT subregion1_name, count(1) as cases
FROM covid
WHERE country_code='US'
GROUP BY subregion1_name
ORDER BY cases DESC

Death Rate

%sql
SELECT country_name, 
(SUM(new_deceased)/ IF(SUM(new_confirmed)=0,1,SUM(new_confirmed))) as MORTALIDAD
FROM covid
WHERE aggregation_level = 0
GROUP BY country_name
ORDER BY MORTALIDAD DESC

Click here to check the Notebook

4.Visualization with Power BI

4.1 Connect Power BI with Databricks

First, let's the configuration to connect Power BI with Databricks. For that, go to Get Data and type Spark in the search bar.

Then, in the cluster configuration, go to configuration and click on the JDBC/ODBC tab.

From the JDBC URL:

jdbc:spark://community.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/AAAAA/BBBBB-jays123

Create the server URL from the JDBC URL:

https://community.cloud.databricks.com:443sql/protocolv1/o/AAAAA/BBBBB-jays123

Select http as protocol and click on Accept.

Then Power BI will ask you for the credentials:

Add them and click on connect. Finally import the covid table.

.

4.2 Charts

You can drag and drop the fields to create different charts as the following:

.

If you want to check better the chart click on this link

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Content

COVID WORKSHOP

1. Create the target bucket

2. ETL Process

2.1 Create a cluster

2.2 Create a notebook

2.3 Add ETL process

3. Consult data from our data

3.1 External table creation

3.2 Executing Queries

¿How many existing cases are around the world?

¿Which are the most affected countries?

Identify the most critical points in the United States

Death Rate

4.Visualization with Power BI

4.1 Connect Power BI with Databricks

4.2 Charts

About

Releases

Packages

yesid-lopez/covid-workshop

Folders and files

Latest commit

History

Repository files navigation

Table of Content

COVID WORKSHOP

1. Create the target bucket

2. ETL Process

2.1 Create a cluster

2.2 Create a notebook

2.3 Add ETL process

3. Consult data from our data

3.1 External table creation

3.2 Executing Queries

¿How many existing cases are around the world?

¿Which are the most affected countries?

Identify the most critical points in the United States

Death Rate

4.Visualization with Power BI

4.1 Connect Power BI with Databricks

4.2 Charts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages