- Data & AI Tech Immersion Workshop – Product Review Guide and Lab Instructions
A modern data warehouse lets you bring together all your data at any scale easily, and to get insights through analytical dashboards, operational reports, or advanced analytics for all your users.
- Combine all your structured, unstructured and semi-structured data (logs, files, and media) using Azure Data Factory to Azure Blob Storage.
- Leverage data in Azure Blob Storage to perform scalable analytics with Azure Databricks and achieve cleansed and transformed data.
- Cleansed and transformed data can be moved to Azure SQL Data Warehouse to combine with existing structured data, creating one hub for all your data. Leverage native connectors between Azure Databricks and Azure SQL Data Warehouse to access and move data at scale.
- Build operational reports and analytical dashboards on top of Azure Data Warehouse to derive insights from the data, and use Azure Analysis Services to serve thousands of end users.
- Run ad hoc queries directly on data within Azure Databricks.
Mapping Data Flows in Azure Data Factory provide a way to transform data at scale without any coding required. You can design a data transformation job in the data flow designer by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Then, complete your data flow with sink to land your results in a destination.
Like many organizations, ContosoAuto generates data from numerous system, each of which has its own location and format, including structured, unstructured, and semi-structured data. They would like the ability to combine and analyze these disparate datasets in order to gain actionable insights that can help them operate their business more efficiently.
In this experience, you will see how you can use the capabilities of Azure Data Factory (ADF) to visually create data flows that read, transform and write data.
-
Browse to the Azure Portal and navigate to the
tech-immersion-XXXXX
resource group and select the Azure Data Factory instancetech-immersion-df-XXXXX
. -
On the Overview blade, select Author & Monitor.
-
Select the Author tab from the left hand side.
-
On the Factory Resources panel, select + and then choose Pipeline.
-
On the Activities panel, in the Search Activities text box type
data flow
. -
Drag the Data Flow entry that appears to the right on to the pipeline design surface and drop it.
-
In the Adding Data Flow panel, choose Create new data flow, select Mapping Data Flow and select OK.
-
Select Finish in the tooltip that appears.
-
Populate the Data Flow Name field on the General tab with a value of
FleetCalculations
. -
In the tool tip that appears, click Finish.
-
Select the Add Source area on the data flow design surface.
-
In the tool tip that appears, read the guide text and select Next until it is dismissed.
-
Near the top of the window, toggle the Data Flow Debug to on and then select OK in the dialog that appears. This will provision a small Databricks cluster behind the scenes that you will use to inspect your data as you are building the data flow. It takes about 5 minutes to be ready, but you can continue on with the next steps while it starts up.
-
Select the new source data set item, and then in the property panel, select the Source Setting tab,
-
Select + New next to Source dataset.
-
On the New Data Set panel, select Azure Blob Storage and then select Continue.
-
On the Select Format panel, select DelimitedText and select Continue.
-
On Set Properties, set the Name to
trip_data_input
and the under Linked service selectAzureBlobStorage
, which is a storage account that has been added for you. -
On Set Properties, for the File Path configure the following values:
- Container:
data
- File:
trip_data_1.csv
- First row as header:
checked
- Container:
-
Select Finish.
-
You now have your source CSV file configured in the data flow.
-
Check if the Data Flow Debug toggle has turned solid blue and shows a green dot. If it has your cluster, it is ready and you can continue. If not, please wait at this step until is ready before continuing.
-
With
source1
still selected in the data flow designer, select Source Settings in the property panel and set the Output stream name totripdata
. -
In the property panel, select Projection and then select Detect data type. This will sample your data and automatically set the data type for each of the columns.
-
The process will run for a few moments. When it completes Review the assignments in the table.
-
You can preview the data output from any data flow node, by selecting that node in the designer and then selecting the Data Preview tab in the property panel and selecting Refresh.
- In a few moment the data preview will load with the preview of the output data for that node.
-
In the Data Flow design surface, below tripdata, select Add Source.
-
For the Output stream name provide
tripfare
, and then select + New to the right of Source dataset. -
On New Dataset, select Azure Blob Storage and select Continue.
-
On Select Format, choose DelimitedText and select Continue.
-
For Set Properties, for Name set it to
trip_fare_input
, for the Linked service selectAzureBlobStorage
and for the File Path provide the following and select Finish:- Container:
data
- File:
trip_fare_1.csv
- First row as header:
checked
- Container:
-
With the tripfare node selected, in the property panel select Projection and then select Detect data type to set the schema for this data source.
-
At this point you should have both the tripdata and tripfare data sources on the design surface.
-
Select the + on the bottom-right of the tripdata source and then select Join.
-
In the property panel, under Join Settings, for Right stream select tripfare. Leave the Join type as
Inner
. For the Join conditions match up the fields as follows (click the + that appears when hovering over the condition row to add new condition):- hack_license : hack_license
- medallion : medallion
- vendor_id : vendor_id
- pickup_datetime : pickup_datetime
- Select the + to the bottom-right of the Join1 node, and select Aggregate.
- On the property panel, under Aggregate Settings, select Group By. For Join1's columns select
vendor_id
.
-
Select the Aggregates tab (to the right of Group By). Add the following aggregates (use the + that appears when hovering over each row to add another column):
- passenger_count : round(sum(passenger_count), 2)
- trip_time_in_secs : round(sum(trip_time_in_secs)/60, 2)
- trip_distance : round(sum(trip_distance), 2)
- TotalTripFare : round(sum({ total_amount}), 2)
NOTE: TotalTripFare is not a source column, so type its name rather than selecting it.
- At this point your data flow should look like the following:
- Select the + to the bottom-right of the Aggregate1 node, and select Sink.
-
Select Finish on the tooltip that appears.
-
With sink1 selected, on the property panel, select + New next to Sink dataset.
-
On New Dataset, select Azure Blob Storage and select Continue.
-
On Select Format, select DelimitedText and select Continue.
-
For Set Properties, for Name set it to
vendorstats
, for the Linked service selectAzureBlobStorage
and for the File Path provide the following and select Finish:- Container:
data
- File: Leave blank
- Container:
-
Save your data flow and pipeline by selecting Publish All. It will take a few moments to publish.
-
From the Factory Resources panel, select pipeline1.
-
From the button bar select Add trigger, Trigger Now and then select Finish to execute your pipeline and the data flow it contains.
-
Select Monitor from the menu tabs on the left. Watch the status of your pipeline as it runs. Select Refresh at any point to update the listing. It should take about 7 minutes to completely execute your pipeline.
-
When the pipeline has completed, navigate to the
tech-immersion-XXXXX
resource group, select thesamplefiles
storage account and select Blobs and then thedata
folder. -
You should see a new file that is the output of your pipeline (one CSV file is output per partition of the data)!
In this task, you will set up your ADLS Gen2 filesystem, and then review and execute ADF pipelines to copy data from various sources, including Cosmos DB, in your ADLS Gen2 filesystem.
-
In a web browser, navigate to the Azure portal, select Resource groups from the left-hand menu, and then select the resource group named tech-immersion-XXXXX resource group (where XXXXX is the unique identifier assigned to you for this workshop).
-
Prior to using ADF to move data into your ADLS Gen2 instance, you must create a filesystem in ADLS Gen2. Within the resource group, select the storage account whose name like
adlsstrgXXXXX
. -
On the Overview blade, look under services and select Data Lake Gen 2 file systems.
-
Select + File System and in the dialog that appears enter
contosoauto
for the name of the file system and select OK.
-
Switch back to the Azure Data Factory Author page.
-
On the ADF Author page, select Pipelines to expand the list, and then select the CopyData pipeline from the list.
The
CopyData
pipeline consists of three copy activities. Two of the activities connect to your Azure SQL Database instance to retrieve vehicle data from tables there. The third connects to Cosmos DB to retrieve batch vehicle telemetry data. Each of the copy activities writes data into files in ADLS Gen2. -
On the pipeline toolbar, select Add Trigger, then select Trigger Now, and then select Finish on the Pipeline Run dialog. You will receive a notification that they
CopyData
pipeline is running. -
To observe the pipeline run, select the Monitor icon from the left-hand menu, which will bring up a list of active and recent pipeline runs.
On the pipeline runs monitor page, you can see all active and recent pipeline runs. The Status field provide and indication of the state of the pipeline run, from In Progress to Failed or Canceled. You also have the option to filter by Status and set custom date ranges to get a specific status and time period.
-
Select the Activity Runs icon under Actions for the currently running pipeline to view the status of the individual activities which make up the pipeline.
The Activity Runs view allows you to monitor individual activities within your pipelines. In this view, you can see the amount of time each activity took to execute, as well as select the various icons under Actions to view the inputs, outputs, and details of each activity run. As with pipeline runs, you are provided with the Status of each activity.
In this experience, you used Azure Data Factory (ADF) and the Copy Activity to move data from Azure SQL Database and Cosmos DB into Azure Data Lake Store Gen 2. You also used mapping data flows to create a data processing pipeline using the visual designer.
To continue learning and expand your understanding use the links below.