-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet SQL benchmarks #1738
Conversation
Once #1775 merges, we can probably clean up this PR and get it merged |
There are definitely tweaks that would be cool to make to this, e.g. testing different column encodings, but I think this is a decent starting point and is now ready for review |
@@ -116,6 +116,7 @@ ci/* | |||
**/*.svg | |||
**/*.csv | |||
**/*.json | |||
**/*.sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct thing to do, but someone should probably verify if RAT is needed for SQL files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also ran this locally 👌 very nice:
cargo bench --bench parquet_query_sql
...
Generating parquet file - /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet
Generated parquet file in 6.7890725 seconds
Using parquet file /var/folders/s3/h5hgj43j0bv83shtmz_t_w400000gn/T/parquet_query_sqlr7Ymzm.parquet
...
ng select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.9s, or reduce sample count to 30.
Benchmarking select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...: Collecting 100 samples in estima select dict_10_required, dict_100_required, MIN(f64_optional), MAX(f64_optional), AVG(f64_optional) ...
time: [128.35 ms 128.65 ms 128.98 ms]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
@Igosuki this might be a cool thing to run on the arrow2 branch to see how the performance compares
} | ||
|
||
fn criterion_benchmark(c: &mut Criterion) { | ||
let (file_path, temp_file) = match std::env::var("PARQUET_FILE") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a neat feature (being able to override the file being tested using thePARQUET_FILE
environment variable.
I wonder if it would be possible to add a note about this in https://github.com/apache/arrow-datafusion/blob/master/DEVELOPERS.md somewhere? Perhaps "how to run benchmarks" section?
} | ||
|
||
// Clean up temporary file if any | ||
std::mem::drop(temp_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to drop the temp file explicitly? Won't it automatically happen when the variable goes out of scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intended as a hint that the lifetime of temp_file
matters, i.e. it must live to the end of the benchmark block. In the past I've accidentally refactored tests with NamedTempFile and its broken in odd ways that have boiled down to the temporary file getting cleaned up too early.
I'll clarify the comment
Looks like there is also a clippy complaint here |
I will rebase once it is merged |
I will sort out the clippy complaint |
Which issue does this PR close?
Closes #TBD.
Rationale for this change
Benchmarks good, more benchmarks more good 😄
What changes are included in this PR?
This adds a benchmark that optionally generates a large-ish parquet file, or uses a file specified by an environment variable, and then runs through a list of queries against this file.
My hope is that this will supplement the TPCH benchmark, with one that is perhaps easier for people to setup and run, and that can be more easily adapted to test different data shapes and queries.
In particular as currently configured this will test:
It could theoretically be extended to incorporate joins, however, as I don't currently have a real-world use-case that produces these, I'd rather leave this to someone with such a workload to model a representative benchmark for.
Unfortunately the generation portion needs apache/arrow-rs#1214 but arrow 9 should be out soon which will contain this. Will keep this as a draft until then.
Are there any user-facing changes?
No