Skip to content

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

Notifications You must be signed in to change notification settings

kwartile/spark-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Spark Benchmark

Overview

Spark Benchmark suite helps you evaluate Spark cluster configuration. This benchmark can also be used to compare the speed, throughput, and resource usage of Spark jobs with other big data frameworks such as Impala and Hive. It contains a set of Spark RDD based operations that performs map, filter, reduceByKey, and join operations.

Data

The benchmark uses the dataset used for Impala performance measurement (http://docs.aws.amazon.com/emr/latest/DeveloperGuide/query-impala-generate-data.html). The dataset consists of three different files:

  • Books
  • Customers
  • Transactions
> head books
0|5-54687-602-6|FOREIGN-LANGUAGE-STUDY|1989-09-29|Saraiva|83.99
1|7-20527-497-2|PHILOSOPHY|1999-09-14|Kyowon|40.99
2|8-98211-350-2|JUVENILE-NONFICTION|1975-01-11|Wolters Kluwer|173.99
3|6-52228-529-3|MATHEMATICS|2010-06-26|Bungeishunju|24.99
4|8-98702-825-4|HUMOR|1990-07-15|China Publishing Group Corporate|64.99
5|3-11023-371-2|LITERARY-CRITICISM|1971-06-04|AST|137.99
> head customers
Customers
0|Sophia PERKINS|1975-11-18|F|OK|[email protected]|963-341-4876
1|Brianna MURRAY|2001-11-02|F|MT|[email protected]|260-164-6277
2|James SCOTT|1997-09-17|M|UT|[email protected]|920-899-8587
3|Samuel GREEN|2013-05-22|F|CO|[email protected]|263-707-8321
4|Logan COLEMAN|1997-12-10|F|NE|[email protected]|333-318-5685
5|Matthew BENNETT|1975-12-26|M|CO|[email protected]|717-808-3733
6|Jace SPENCER|2013-10-30|M|KS|[email protected]|448-105-3939
> head transactions
0|29948726|124004825|21|2000-10-03 12:08:37
1|76896577|10225228|17|2001-04-23 15:21:18
2|77394742|62037151|23|2008-02-22 11:52:36
3|23558280|21960491|29|2000-06-22 10:14:48
4|5742930|73207419|15|2004-11-26 00:46:53
5|101531051|122609274|13|2008-01-14 05:26:46

Benchmark

The benchmarks contains four tests: RDDScan: RDDScan reads the customer file and performs a filter operation. It is equivalent to a select-where statement in SQL. RDDAggregate: RDDAgregate operation scans the books file and perform reduceByKey to aggregate count of books by category. It then sorts the results based on the book count. RDDTwoWayJoin: This operation performs a join between books and transactions between 2008 and 2010, aggregates the results on book category, and returns sorted results based on the total transaction amount. RDDThreeWayJoin: This operation is similar to the above except we perform addition join with the customer table and filter the results on three states

Supported Platform

The pom file currently includes support for Spark 1.6 on CDH 5.8. But this can be easily modified to run on Spark 2.x and other version of Cloudera, HortonWorks, and Apache distribution.

Build

Use the standard maven command mvn package to build.

Run

spark-submit  --class com.kwartile.benchmark.spark.<RDDScan | RDDAggregate | RDDTwoWayJoin | RDDTHreeWayJoin> --master yarn --executor-memory <mem> --executor-cores <num> --num-executors <num>  --conf spark.yarn.executor.memoryOverhead=<mem_in_mb> perf-benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar --input-path <hdfs location>

Hive and Impala Query

You can use the following equivalent Hive/Impala query to compare the performance.

# Scan Query
SELECT COUNT(*)
FROM customers256gb
WHERE name = 'Scarlett STEVENS';
# Aggregation
SELECT category, count(*) cnt
FROM books256gb
GROUP BY category
ORDER BY cnt DESC LIMIT 10;
# Two Way Join
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue
FROM (
SELECT books256gb.category AS book_category, SUM(books256gb.price * transactions256gb.quantity) AS revenue
FROM books256gb JOIN transactions256gb ON (
transactions256gb.book_id = books256gb.id
AND YEAR(transactions256gb.transaction_date) BETWEEN 2008 AND 2010
)
GROUP BY books256gb.category
) tmp
ORDER BY revenue DESC LIMIT 10;
# Three Way Join
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue
FROM (
  SELECT books256gb.category AS book_category, SUM(books256gb.price * transactions256gb.quantity) AS revenue
  FROM books256gb
  JOIN transactions256gb ON (
    transactions256gb.book_id = books256gb.id
  )
  JOIN customers256gb ON (
    transactions256gb.customer_id = customers256gb.id
    AND customers256gb.state IN ('WA', 'CA', 'NY')
  )
  GROUP BY books256gb.category
) tmp
ORDER BY revenue DESC LIMIT 10;

Questions & Feedback

Please contact us as [email protected] for any question or enhancement request.

About

Spark Benchmark suite to evaluate cluster configuration and compare the performance with other big data frameworks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages