feat: create benchmarks for merge (#1857)

# Description Implements benchmarks that are similar to Spark's Delta benchmarks. Enable us to have a standard benchmark to measure improvements to merge and some pieces can be factored out to build a framework for bench marking delta workflows.
delta-io · Nov 20, 2023 · 2c8c0ec · 2c8c0ec
1 parent 8a66343
commit 2c8c0ec
Show file tree

Hide file tree

Showing 3 changed files with 748 additions and 0 deletions.
diff --git a/crates/benchmarks/Cargo.toml b/crates/benchmarks/Cargo.toml
@@ -0,0 +1,46 @@
+[package]
+name = "delta-benchmarks"
+version = "0.0.1"
+authors = ["David Blajda <[email protected]>"]
+homepage = "https://github.com/delta-io/delta.rs"
+license = "Apache-2.0"
+keywords = ["deltalake", "delta", "datalake"]
+description = "Delta-rs Benchmarks"
+edition = "2021"
+
+[dependencies]
+clap = { version = "4", features = [ "derive" ] }
+chrono = { version = "0.4.31", default-features = false, features = ["clock"] }
+tokio = { version = "1", features = ["fs", "macros", "rt", "io-util"] }
+env_logger = "0"
+
+# arrow
+arrow = { workspace = true }
+arrow-array = { workspace = true }
+arrow-buffer = { workspace = true }
+arrow-cast = { workspace = true }
+arrow-ord = { workspace = true }
+arrow-row = { workspace = true }
+arrow-schema = { workspace = true, features = ["serde"] }
+arrow-select = { workspace = true }
+parquet = { workspace = true, features = [
+    "async",
+    "object_store",
+] }
+
+# serde
+serde = { workspace = true, features = ["derive"] }
+serde_json = { workspace = true }
+
+# datafusion
+datafusion = { workspace = true }
+datafusion-expr = { workspace = true }
+datafusion-common = { workspace = true }
+datafusion-proto = { workspace = true }
+datafusion-sql = { workspace = true }
+datafusion-physical-expr = { workspace = true }
+
+[dependencies.deltalake-core]
+path = "../deltalake-core"
+version = "0"
+features = ["datafusion"]
diff --git a/crates/benchmarks/README.md b/crates/benchmarks/README.md
@@ -0,0 +1,55 @@
+# Merge
+The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835).
+
+
+## Dataset
+
+Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor 
+
+```
+aws s3api list-objects --bucket devrel-delta-datasets --request-payer requester --prefix tpcds-2.13/tpcds_sf1_parquet/web_returns/
+```
+
+You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) 
+You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC.
+
+## Commands
+These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing.
+
+### Convert
+Converts a TPC-DS web_returns csv into a Delta table
+Assumes the dataset is pipe delimited and records do not have a trailing delimiter
+
+```
+ cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns
+```
+
+### Standard
+Execute the standard merge bench suite.
+Results can be saved to a delta table for further analysis.
+This table has the following schema:
+
+group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution
+name: The benchmark name that was executed
+sample: The iteration number for a given benchmark name
+duration_ms: How long the benchmark took in ms
+data: Free field to pack any additonal data
+
+```
+ cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results 
+```
+
+### Compare
+Compare the results of two different runs.
+The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case
+
+```
+ cargo run --release --bin merge -- compare data/benchmarks/ 1698636172801 data/benchmarks/ 1699759539902
+```
+
+### Show
+Show all benchmarks results from a delta table
+
+```
+ cargo run --release --bin merge -- show data/benchmark
+```