-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add -o option to all e2e benches #5658
Conversation
39db9cc
to
7b3c07e
Compare
let elapsed = start.elapsed().as_millis(); | ||
|
||
let elapsed = start.elapsed().as_secs_f64() * 1000.0; | ||
let numrows = batches.iter().map(|b| b.num_rows()).sum::<usize>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaylmiller for looking into this.
Noticed for other testcases you calc numrows before elapsed, perhaps to prevent numrows runtime to be part of benchmark runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Good catch thank youi... was a mistake by me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing I'm thinking is can it be calculating num rows triggers some system cache and benchmark will run faster, alhough its unexpected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think num_rows is pretty fast (it doesn't actually do any work , it just returns a field's value): https://docs.rs/arrow-array/35.0.0/src/arrow_array/record_batch.rs.html#278
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a definite improvement to me 🚀 -- thank you @jaylmiller
I had some suggestions about improving code ergonomics but I don't think they are required to merge this PR if you would prefer not to do them.
disjunction([ | ||
("Selective-ish filter", col("request_method").eq(lit("GET"))), | ||
( | ||
"Non-selective filter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice to add the details into the output file
benchmarks/src/lib.rs
Outdated
/// A single iteration of a benchmark query | ||
#[derive(Debug, Serialize)] | ||
struct QueryIter { | ||
elapsed: f64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some documentation about what unit this is in (I think it is milliseconds?)
Relatedly I wonder if we could make this API easier to use by storing a Duration
https://doc.rust-lang.org/std/time/struct.Duration.html, calculated with SystemTime::now() - start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed elapsed
to be a Duration
object and am using a custom serializer to make it appear as unix secs in the output json
let elapsed = start.elapsed().as_millis(); | ||
|
||
let elapsed = start.elapsed().as_secs_f64() * 1000.0; | ||
let numrows = batches.iter().map(|b| b.num_rows()).sum::<usize>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think num_rows is pretty fast (it doesn't actually do any work , it just returns a field's value): https://docs.rs/arrow-array/35.0.0/src/arrow_array/record_batch.rs.html#278
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great -- thank you @jaylmiller
println!( | ||
"h2o groupby query {} took {} ms", | ||
opt.query, | ||
elapsed.as_secs_f64() * 1000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Part of #5561
Rationale for this change
For e2e benchmarks, the TCPH bin has an option to output a machine readable file, which can then be consumed the script from PR #5655 . It would be nice to be able to re-use this script for all bins in the e2e benches.
What changes are included in this PR?
This PR pulls out the existing logic from
tpch.rs
that (optionally) writes the run data to a machine readable json file. That logic is then used in all the other benchmarks, adding a-o
option to every bin in the e2e benchmarks dir.Are these changes tested?
Are there any user-facing changes?