Set up repo where we can push benchmark results #473

andygrove · 2022-10-28T13:26:23Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We do not have a formal way of tracking benchmark performance over time. I have been running benchmarks occasionally and I have some spreadsheets that I have shared but this is not ideal.

Describe the solution you'd like
Create a new GitHub repo datafusion-contrib/benchmark-automation and define a structure so that anyone can create a PR to submit results from a benchmark run for DataFusion or Ballista.

Later, we can add scripts to produce charts and look for regressions.

Describe alternatives you've considered
Keep doing this in an ad-hoc way.

Additional context
None

The text was updated successfully, but these errors were encountered:

andygrove · 2022-10-28T13:26:47Z

@Dandandan @isidentical What do you think? This could apply to DataFusion as well.

isidentical · 2022-10-28T13:34:19Z

I think this might have an interesting use. Each user would have to run at least two revisions when they are submitting since there is no common baseline (everyone's machines & runtime conditions are different). E.g. a script that automatically runs the last released version as the baseline and the last commit from the master branch as the target revision (and maybe we can add options to also compare it with older revisions to get a holistic sense; v12 vs v13 vs HEAD). After a certain number of samples, we should at least start seeing the trend in performance (and most importantly detect obvious regressions) which I think would be really cool 💯

Dandandan · 2022-10-28T22:25:11Z

Sounds great!
Maybe one thing we can start with is running all benchmark queries and outputting the results to e.g. a CSV file (e.g. with cli options --all --csv). This also helps with just running the benchmarks while developing (instead of having to run them one by one).
We could just store this output in a repo with some metadata like commit hash, machine details etc.

andygrove · 2022-10-30T16:59:30Z

Ballista already has the option to produice a summary JSON File:

    /// Path to output directory where JSON summary file should be written to
    #[structopt(parse(from_os_str), short = "o", long = "output")]
    output_path: Option<PathBuf>,

andygrove · 2022-10-30T17:00:41Z

I updated the PR description to suggest creating this repo at datafusion-contrib/benchmark-automation instead of as an Apache Arrow subproject. I figure we dump automation scripts into this repo as well.

andygrove · 2022-11-04T14:24:31Z

I went ahead and created https://github.com/datafusion-contrib/benchmark-automation

andygrove added the enhancement New feature or request label Oct 28, 2022

andygrove closed this as completed Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up repo where we can push benchmark results #473

Set up repo where we can push benchmark results #473

andygrove commented Oct 28, 2022 •

edited

Loading

andygrove commented Oct 28, 2022

isidentical commented Oct 28, 2022

Dandandan commented Oct 28, 2022

andygrove commented Oct 30, 2022

andygrove commented Oct 30, 2022

andygrove commented Nov 4, 2022

Set up repo where we can push benchmark results #473

Set up repo where we can push benchmark results #473

Comments

andygrove commented Oct 28, 2022 • edited Loading

andygrove commented Oct 28, 2022

isidentical commented Oct 28, 2022

Dandandan commented Oct 28, 2022

andygrove commented Oct 30, 2022

andygrove commented Oct 30, 2022

andygrove commented Nov 4, 2022

andygrove commented Oct 28, 2022 •

edited

Loading