Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create benchmarks for merge #1857

Merged
merged 19 commits into from
Nov 20, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions crates/benchmarks/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
[package]
name = "delta-benchmarks"
version = "0.0.1"
authors = ["Qingping Hou <[email protected]>"]
Blajda marked this conversation as resolved.
Show resolved Hide resolved
homepage = "https://github.com/delta-io/delta.rs"
license = "Apache-2.0"
keywords = ["deltalake", "delta", "datalake"]
description = "Delta-rs Benchmarks"
edition = "2021"

[dependencies]
clap = { version = "4", features = [ "derive" ] }
chrono = { version = "0.4.31", default-features = false, features = ["clock"] }
tokio = { version = "1", features = ["fs", "macros", "rt", "io-util"] }
env_logger = "0"

# arrow
arrow = { workspace = true }
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true }
arrow-ord = { workspace = true }
arrow-row = { workspace = true }
arrow-schema = { workspace = true, features = ["serde"] }
arrow-select = { workspace = true }
parquet = { workspace = true, features = [
"async",
"object_store",
] }

# serde
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }

# datafusion
datafusion = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-common = { workspace = true }
datafusion-proto = { workspace = true }
datafusion-sql = { workspace = true }
datafusion-physical-expr = { workspace = true }

[dependencies.deltalake-core]
path = "../deltalake-core"
version = "0"
features = ["datafusion"]
55 changes: 55 additions & 0 deletions crates/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Merge
The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835).


## Dataset

Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor

```
aws s3api list-objects --bucket devrel-delta-datasets --request-payer requester --prefix tpcds-2.13/tpcds_sf1_parquet/web_returns/
```

You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC.

## Commands
These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing.

### Convert
Converts a TPC-DS web_returns csv into a Delta table
Assumes the dataset is pipe delimited and records do not have a trailing delimiter

```
cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns
```

### Standard
Execute the standard merge bench suite.
Results can be saved to a delta table for further analysis.
This table has the following schema:

group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution
name: The benchmark name that was executed
sample: The iteration number for a given benchmark name
duration_ms: How long the benchmark took in ms
data: Free field to pack any additonal data

```
cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results
```

### Compare
Compare the results of two different runs.
The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case

```
cargo run --release --bin merge -- compare data/benchmarks/ 1698636172801 data/benchmarks/ 1699759539902
```

### Show
Show all benchmarks results from a delta table

```
cargo run --release --bin merge -- show data/benchmark
```
Loading
Loading