Add parquet SQL benchmarks #1738

tustvold · 2022-02-03T15:42:19Z

Which issue does this PR close?

Closes #TBD.

Rationale for this change

Benchmarks good, more benchmarks more good 😄

What changes are included in this PR?

This adds a benchmark that optionally generates a large-ish parquet file, or uses a file specified by an environment variable, and then runs through a list of queries against this file.

My hope is that this will supplement the TPCH benchmark, with one that is perhaps easier for people to setup and run, and that can be more easily adapted to test different data shapes and queries.

In particular as currently configured this will test:

Dictionary arrays
Nullable arrays
Large-ish parquet files (~200Mb)
Basic table scans with filters and aggregates
...Suggestions welcome 😄

It could theoretically be extended to incorporate joins, however, as I don't currently have a real-world use-case that produces these, I'd rather leave this to someone with such a workload to model a representative benchmark for.

Unfortunately the generation portion needs apache/arrow-rs#1214 but arrow 9 should be out soon which will contain this. Will keep this as a draft until then.

Are there any user-facing changes?

No

alamb · 2022-02-13T13:23:33Z

Once #1775 merges, we can probably clean up this PR and get it merged

tustvold · 2022-02-15T10:12:24Z

There are definitely tweaks that would be cool to make to this, e.g. testing different column encodings, but I think this is a decent starting point and is now ready for review

tustvold · 2022-02-15T10:17:42Z

dev/release/rat_exclude_files.txt

@@ -116,6 +116,7 @@ ci/*
 **/*.svg
 **/*.csv
 **/*.json
+**/*.sql


I think this is the correct thing to do, but someone should probably verify if RAT is needed for SQL files

This is fine in my opinion

alamb

I also ran this locally 👌 very nice:

cargo bench --bench parquet_query_sql
...
Generating parquet file - /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet
Generated parquet file in 6.7890725 seconds
Using parquet file /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet


...
ng select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s, or reduce sample count to 30.
Benchmarking select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Collecting 100 samples in estima                                                                                                                                                      select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                        
                        time:   [128.35 ms 128.65 ms 128.98 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

@Igosuki this might be a cool thing to run on the arrow2 branch to see how the performance compares

alamb · 2022-02-15T16:37:57Z

datafusion/benches/parquet_query_sql.rs

+}
+
+fn criterion_benchmark(c: &mut Criterion) {
+    let (file_path, temp_file) = match std::env::var("PARQUET_FILE") {


This is a neat feature (being able to override the file being tested using thePARQUET_FILE environment variable.

I wonder if it would be possible to add a note about this in https://github.com/apache/arrow-datafusion/blob/master/DEVELOPERS.md somewhere? Perhaps "how to run benchmarks" section?

alamb · 2022-02-15T16:38:58Z

datafusion/benches/parquet_query_sql.rs

+    }
+
+    // Clean up temporary file if any
+    std::mem::drop(temp_file);


Why do we need to drop the temp file explicitly? Won't it automatically happen when the variable goes out of scope?

It was intended as a hint that the lifetime of temp_file matters, i.e. it must live to the end of the benchmark block. In the past I've accidentally refactored tests with NamedTempFile and its broken in odd ways that have boiled down to the temporary file getting cleaned up too early.

I'll clarify the comment

alamb · 2022-02-15T18:25:07Z

Looks like there is also a clippy complaint here

Igosuki · 2022-02-15T18:41:41Z

I also ran this locally ok_hand very nice:

cargo bench --bench parquet_query_sql
...
Generating parquet file - /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet
Generated parquet file in 6.7890725 seconds
Using parquet file /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet


...
ng select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s, or reduce sample count to 30.
Benchmarking select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Collecting 100 samples in estima                                                                                                                                                      select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                        
                        time:   [128.35 ms 128.65 ms 128.98 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

@Igosuki this might be a cool thing to run on the arrow2 branch to see how the performance compares

I will rebase once it is merged

alamb · 2022-02-15T20:38:36Z

I will sort out the clippy complaint

Add parquet SQL benchmarks

2f2de37

github-actions bot added the datafusion Changes in the datafusion crate label Feb 3, 2022

tustvold added 2 commits February 15, 2022 09:46

Merge remote-tracking branch 'upstream/master' into parquet-benchmarks

b167e26

Restrict benchmark value ranges

d505e87

tustvold marked this pull request as ready for review February 15, 2022 10:11

Add RAT exclude

c66e8c9

tustvold commented Feb 15, 2022

View reviewed changes

tustvold mentioned this pull request Feb 15, 2022

Async ParquetExec #1617

Closed

alamb added the development-process Related to development process of DataFusion label Feb 15, 2022

alamb approved these changes Feb 15, 2022

View reviewed changes

Merge remote-tracking branch 'apache/master' into parquet-benchmarks

2ce4677

appease clippy

5b7f635

alamb merged commit 217fa99 into apache:master Feb 15, 2022

tustvold mentioned this pull request Feb 15, 2022

Add benchmarks section to DEVELOPERS.md #1838

Merged

tustvold mentioned this pull request Mar 5, 2022

ARROW2: Performance benchmark #1652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parquet SQL benchmarks #1738

Add parquet SQL benchmarks #1738

tustvold commented Feb 3, 2022

alamb commented Feb 13, 2022

tustvold commented Feb 15, 2022 •

edited

Loading

tustvold Feb 15, 2022

alamb Feb 15, 2022

alamb left a comment

alamb Feb 15, 2022

alamb Feb 15, 2022

tustvold Feb 15, 2022 •

edited

Loading

alamb commented Feb 15, 2022

Igosuki commented Feb 15, 2022

alamb commented Feb 15, 2022

Add parquet SQL benchmarks #1738

Add parquet SQL benchmarks #1738

Conversation

tustvold commented Feb 3, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Feb 13, 2022

tustvold commented Feb 15, 2022 • edited Loading

tustvold Feb 15, 2022

Choose a reason for hiding this comment

alamb Feb 15, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 15, 2022

Choose a reason for hiding this comment

alamb Feb 15, 2022

Choose a reason for hiding this comment

tustvold Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

alamb commented Feb 15, 2022

Igosuki commented Feb 15, 2022

alamb commented Feb 15, 2022

tustvold commented Feb 15, 2022 •

edited

Loading

tustvold Feb 15, 2022 •

edited

Loading