Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet SQL benchmarks #1738

Merged
merged 6 commits into from
Feb 15, 2022
Merged

Add parquet SQL benchmarks #1738

merged 6 commits into from
Feb 15, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Feb 3, 2022

Which issue does this PR close?

Closes #TBD.

Rationale for this change

Benchmarks good, more benchmarks more good 😄

What changes are included in this PR?

This adds a benchmark that optionally generates a large-ish parquet file, or uses a file specified by an environment variable, and then runs through a list of queries against this file.

My hope is that this will supplement the TPCH benchmark, with one that is perhaps easier for people to setup and run, and that can be more easily adapted to test different data shapes and queries.

In particular as currently configured this will test:

  • Dictionary arrays
  • Nullable arrays
  • Large-ish parquet files (~200Mb)
  • Basic table scans with filters and aggregates
  • ...Suggestions welcome 😄

It could theoretically be extended to incorporate joins, however, as I don't currently have a real-world use-case that produces these, I'd rather leave this to someone with such a workload to model a representative benchmark for.

Unfortunately the generation portion needs apache/arrow-rs#1214 but arrow 9 should be out soon which will contain this. Will keep this as a draft until then.

Are there any user-facing changes?

No

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Feb 3, 2022
@alamb
Copy link
Contributor

alamb commented Feb 13, 2022

Once #1775 merges, we can probably clean up this PR and get it merged

@tustvold tustvold marked this pull request as ready for review February 15, 2022 10:11
@tustvold
Copy link
Contributor Author

tustvold commented Feb 15, 2022

There are definitely tweaks that would be cool to make to this, e.g. testing different column encodings, but I think this is a decent starting point and is now ready for review

@@ -116,6 +116,7 @@ ci/*
**/*.svg
**/*.csv
**/*.json
**/*.sql
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct thing to do, but someone should probably verify if RAT is needed for SQL files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine in my opinion

@tustvold tustvold mentioned this pull request Feb 15, 2022
@alamb alamb added the development-process Related to development process of DataFusion label Feb 15, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also ran this locally 👌 very nice:

cargo bench --bench parquet_query_sql
...
Generating parquet file - /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet
Generated parquet file in 6.7890725 seconds
Using parquet file /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet


...
ng select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s, or reduce sample count to 30.
Benchmarking select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Collecting 100 samples in estima                                                                                                                                                      select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                        
                        time:   [128.35 ms 128.65 ms 128.98 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

@Igosuki this might be a cool thing to run on the arrow2 branch to see how the performance compares

}

fn criterion_benchmark(c: &mut Criterion) {
let (file_path, temp_file) = match std::env::var("PARQUET_FILE") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat feature (being able to override the file being tested using thePARQUET_FILE environment variable.

I wonder if it would be possible to add a note about this in https://github.com/apache/arrow-datafusion/blob/master/DEVELOPERS.md somewhere? Perhaps "how to run benchmarks" section?

}

// Clean up temporary file if any
std::mem::drop(temp_file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to drop the temp file explicitly? Won't it automatically happen when the variable goes out of scope?

Copy link
Contributor Author

@tustvold tustvold Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended as a hint that the lifetime of temp_file matters, i.e. it must live to the end of the benchmark block. In the past I've accidentally refactored tests with NamedTempFile and its broken in odd ways that have boiled down to the temporary file getting cleaned up too early.

I'll clarify the comment

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

Looks like there is also a clippy complaint here

@Igosuki
Copy link
Contributor

Igosuki commented Feb 15, 2022

I also ran this locally ok_hand very nice:

cargo bench --bench parquet_query_sql
...
Generating parquet file - /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet
Generated parquet file in 6.7890725 seconds
Using parquet file /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet


...
ng select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s, or reduce sample count to 30.
Benchmarking select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Collecting 100 samples in estima                                                                                                                                                      select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...                        
                        time:   [128.35 ms 128.65 ms 128.98 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

@Igosuki this might be a cool thing to run on the arrow2 branch to see how the performance compares

I will rebase once it is merged

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

I will sort out the clippy complaint

@alamb alamb merged commit 217fa99 into apache:master Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate development-process Related to development process of DataFusion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants