Loading and Processing Data using Spark

Contents of the file

Project Aim
Project Environment
Dataset Used
Sample Operations

Project Aim

This project mainly aims to provide some samples on how to load data from either a CSV or a Parquet Table as well as some code snippets for processing this loaded data. The code provided in this project is written in Scala. Apache Spark also provides the ability to write the code in either Python or Java. However, this project was meant to target efficiency, that’s why Scala was used as it provides more efficiency compared to other programming languages mentioned previously especially for spark because spark itself is implemented using Scala.

Project Environment

In order to set up this project successfully, some prerequisites were required to provide the perfect environment for this project. Here are some of essential tools for our environment:

Cloudera Quickstart VM

This is the virtual machine where most of our work will be done. Cloudera QuickStart virtual machines (VMs) include everything you need to try CDH, Cloudera Manager, Impala, and Cloudera Search. Hence, we will use it to connect with Hive database later on.
You can download Cloudera from this link.
Java Development Kit (JDK) 8

This is the software development environment that offers a collection of tools and libraries necessary for developing Java applications. This version is specifically required for writing code in Scala.
You can download Cloudera from this link.
Apache Spark

This is the main component of our project which will enable us to apply some processing operations on our data.
You can download Cloudera from this link.
Intellij

This is the main cross-platform IDE that provides us to write a Scala code to apply some Spark operations.
You can download Cloudera from this link.

Dataset Used

In this project I used a randomly selected dataset from Kaggle. This dataset is a simple collection of Credit Card Transactions. You can access the dataset from this link. This dataset consists of the following fields:

Field Name	Field Description
Account Number	Represents the customer’s bank account number
Customer ID	Represents a unique number for each bank customer
Credit Limit	Represents the maximum amount to be withdrawal
Available Money	Represents the amount of Debit Balance available
Transaction Date & Time	Represents the transaction date and time
Transaction Amount	Represents the total amount of the transaction
Merchant Name	Represents the name of account accepting payments
Merchant Country	Represents the country of account accepting payments
Merchant Category	Represents the category of account accepting payments
Expiry Date	Represents the expiry date of the Credit Card
Transaction Type	Represents the type of the credit card transaction
Is Fraud?	Represents whether the transaction is Fraud or not

Sample Operations

In this section, I will provide a sort of visualization some sample operations used in this project. For furthermore operations along with their output results, please check the following Technical Report.

Operation #1: Load data from CSV into a Spark Data Frame:

Code Snippet:

val transactions = spark.read.format("csv").option("header", "true").load("/home/transactions.csv")

Operation #2: Load data from Parquet Table into a Spark Data Frame:

Code Snippet:

transactions.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("bdp.hv_parq")
val df = spark.sql("SELECT * FROM bdp.hv_parq")

Operation #3: Find out how many Fraud Transactions are there:

Code Snippet:
```
transactions.groupBy("IsFraud").count().show()
```
Output:

IsFraud Count

true 11302

false 630612

Output Chart:
Operation #4: Find out the top 5 most merchants (According to number of transactions):

Code Snippet:
```
transactions.groupBy("Merchant_Name").count().orderBy(desc("count")).limit(5).show()
```
Output:

Merchant_Name Count

Lyft 25311

Uber 25263

gap.com 13824

apple.com 13607

target.com 13601

Output Chart:
Operation #5: Find out how many types of transactions are available arranged according to popularity:

Code Snippet:
```
transactions.groupBy("Trans_Type").count().orderBy(desc("count")).show()
```
Output:

Trans_Type Count

PURCHASE 608685

ADDRESS_VERIFICATION 16478

REVERSAL 16162

Others 589

Output Chart:

Operation #6: Find out the number of customers that go to the gym and eat fast food:

Code Snippet:

transactions.createOrReplaceTempView("TransactionsTable")
val snippet = spark.sql(
"""SELECT COUNT(*) FROM (
(SELECT customerId FROM TransactionsTable WHERE merchantCategoryCode='gym' 
GROUP BY customerId) 
INTERSECT 
(SELECT customerId FROM TransactionsTable WHERE merchantCategoryCode='fastfood' GROUP BY customerId))""")
val results = snippet.collect()
results.foreach(println)

Output:

Operation #7: Find out the customers who tried McDonalds before Hardee's and never tried Five Guys:

Code Snippet:

val mcdonalds = spark.sql(
    """
    SELECT customerId, MIN(transactionDate) AS startDate
    FROM TransactionsTable
    WHERE merchantName LIKE 'McDonalds%'
    GROUP BY customerId
    ORDER BY MIN(transactionDate)
    """)
  
val hardees = spark.sql(
    """
    SELECT customerId, MIN(transactionDate) AS startDate
    FROM TransactionsTable
    WHERE merchantName LIKE "Hardee's%"
    GROUP BY customerId
    ORDER BY MIN(transactionDate)
    """)
    
mcdonalds.createOrReplaceTempView("McDonalds_Table")
hardees.createOrReplaceTempView("Hardees_Table")

val Mc_Hardees = spark.sql(
    """
    SELECT McDonalds_Table.customerId
    FROM McDonalds_Table JOIN Hardees_Table
        ON McDonalds_Table.customerId = Hardees_Table.customerId
    WHERE McDonalds_Table.startDate < Hardees_Table.startDate
    """)

Mc_Hardees.createOrReplaceTempView("McHardees")

val result = spark.sql(
"""
(SELECT customerId
      FROM McHardees)
      EXCEPT
      (SELECT customerId
      FROM TransactionsTable
      WHERE merchantName LIKE 'Five Guys%'
      GROUP BY customerId)
  """)

result.show()

Output:

customerId
847174168
667315366
477081008

Operation #8: Find out top 3 customers total spending on food/fast food/food delivery on each month:

Code Snippet:

val foodCustomers = spark.sql(
    """
       SELECT customerId, transactionAmount, MONTH(transactionDate) AS Month
       FROM TransactionsTable
       WHERE
          merchantCategoryCode='fastfood' OR
          merchantCategoryCode='food' OR
          merchantCategoryCode='food_delivery'
       ORDER BY Month
    """)
  foodCustomers.createOrReplaceTempView("FoodCustomers")

  val result = spark.sql(
    """
      Select Month, customerId, TotalAmount FROM (
      Select Month, customerId, TotalAmount, row_number() over(partition by Month
  	ORDER BY Month, TotalAmount Desc) as rn
      FROM(
      SELECT  Month, customerId, Round(Sum(transactionAmount), 2) as TotalAmount
      FROM    FoodCustomers
      GROUP BY Month, customerId) as Rank) as Result
      WHERE rn <= 3
    """)
    
result.show()

Output:

Month	customerId	TotalAmount
1	314506271	43343.01
1	456044564	41831.58
1	772212779	38482.51
2	314506271	36458.69
2	772212779	31741.49

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.bsp		.bsp
.idea		.idea
project		project
src/main/scala		src/main/scala
target		target
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loading and Processing Data using Spark

Contents of the file

Project Aim

Project Environment

Cloudera Quickstart VM

Java Development Kit (JDK) 8

Apache Spark

Intellij

Dataset Used

Sample Operations

Operation #1: Load data from CSV into a Spark Data Frame:

Operation #2: Load data from Parquet Table into a Spark Data Frame:

Operation #3: Find out how many Fraud Transactions are there:

Operation #4: Find out the top 5 most merchants (According to number of transactions):

Operation #5: Find out how many types of transactions are available arranged according to popularity:

Operation #6: Find out the number of customers that go to the gym and eat fast food:

Operation #7: Find out the customers who tried McDonalds before Hardee's and never tried Five Guys:

Operation #8: Find out top 3 customers total spending on food/fast food/food delivery on each month:

About

Releases

Packages

Languages

Trans_Type	Count
PURCHASE	608685
ADDRESS_VERIFICATION	16478
REVERSAL	16162
Others	589

IsFraud	Count
true	11302
false	630612

Merchant_Name	Count
Lyft	25311
Uber	25263
gap.com	13824
apple.com	13607
target.com	13601

Shehab7osny/HelloSpark

Folders and files

Latest commit

History

Repository files navigation

Loading and Processing Data using Spark

Contents of the file

Project Aim

Project Environment

Cloudera Quickstart VM

Java Development Kit (JDK) 8

Apache Spark

Intellij

Dataset Used

Sample Operations

Operation #1: Load data from CSV into a Spark Data Frame:

Operation #2: Load data from Parquet Table into a Spark Data Frame:

Operation #3: Find out how many Fraud Transactions are there:

Operation #4: Find out the top 5 most merchants (According to number of transactions):

Operation #5: Find out how many types of transactions are available arranged according to popularity:

Operation #6: Find out the number of customers that go to the gym and eat fast food:

Operation #7: Find out the customers who tried McDonalds before Hardee's and never tried Five Guys:

Operation #8: Find out top 3 customers total spending on food/fast food/food delivery on each month:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages