Skip to content

Latest commit

 

History

History
116 lines (101 loc) · 3.34 KB

chapter_02.md

File metadata and controls

116 lines (101 loc) · 3.34 KB

Chapter 2

Chapter 2 of the book contains some first Spark examples. We will explain how to run it on your machine.

Reading a text into a Spark DataFrame

This is a very simple first task to test your Spark environment. The Task is to read a text file from disk into a Spark DataFrame and print out the first 1o lines of the text.

This was realised in a Python Jupyter Notebook in the file Chapter-02.ipynb.

The M&M task

The task of the program is to analyze a data describing the consumption of M&Ms in different colors on events in different states. The file mnm_dataset.csvcontains 100,000 lines of data points. Each data points describes the consumption of M&Ms in a certain color on an event in a certain state. Here is an example of the data points in the file:

State,Color,Count
TX,Red,20
NV,Blue,66
CO,Blue,79

The first line of the file contains the schema and describes the three different columns. The file mnm_dataset.csv and the Python file mnmcount.pycan be downloaded from the github repo of the book.

The task of the program is to aggregate all data points by state and color and sort the output by the number of M&Ms consumed. In a second output, the program filters the results for the state California (CA).

Running the M&M Example

Download the Python file mnmcount.py to a local directory on your machine which is running Spark.

Download the data filemnm_dataset.csv into the subdirectory data

Run the command:

$SPARK_HOME/bin/spark-submit mnmcount.py data/mnm_dataset.csv

To switch off all the info messages coming from the console, you can set log4j.rootCategory=WARN in the log4j.properties file in the directory $SPARK_HOME/conf on your machine.

This is the output of the program:

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|NV   |Green |1698 |
|AZ   |Brown |1698 |
|WY   |Green |1695 |
|CO   |Blue  |1695 |
|NM   |Red   |1690 |
|AZ   |Orange|1689 |
|NM   |Yellow|1688 |
|NM   |Brown |1687 |
|UT   |Orange|1684 |
|NM   |Green |1682 |
|UT   |Red   |1680 |
|AZ   |Green |1676 |
|NV   |Yellow|1675 |
|NV   |Blue  |1673 |
|WA   |Red   |1671 |
|WY   |Red   |1670 |
|WA   |Brown |1669 |
|NM   |Orange|1665 |
|WY   |Blue  |1664 |
|WA   |Yellow|1663 |
|WA   |Orange|1658 |
|CA   |Orange|1657 |
|NV   |Brown |1657 |
|CA   |Red   |1656 |
|CO   |Brown |1656 |
|UT   |Blue  |1655 |
|AZ   |Yellow|1654 |
|TX   |Orange|1652 |
|AZ   |Red   |1648 |
|OR   |Blue  |1646 |
|UT   |Yellow|1645 |
|OR   |Red   |1645 |
|CO   |Orange|1642 |
|TX   |Brown |1641 |
|NM   |Blue  |1638 |
|AZ   |Blue  |1636 |
|OR   |Green |1634 |
|UT   |Brown |1631 |
|WY   |Yellow|1626 |
|WA   |Blue  |1625 |
|CO   |Red   |1624 |
|OR   |Brown |1621 |
|TX   |Blue  |1614 |
|OR   |Yellow|1614 |
|NV   |Red   |1610 |
|CA   |Blue  |1603 |
|WY   |Orange|1595 |
|UT   |Green |1591 |
|WY   |Brown |1532 |
+-----+------+-----+

Total Rows = 60
+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|CA   |Green |1723 |
|CA   |Brown |1718 |
|CA   |Orange|1657 |
|CA   |Red   |1656 |
|CA   |Blue  |1603 |
+-----+------+-----+