-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Lower Google Benchmark Suite Runtime #5773
Comments
@dillon-cullinan We can also provide our own base class / fixture for benchmarks which provides a built-in command line option to set max_iterations. Then all libcudf benchmarks can inherit that fixture and benefit from the limit. I think a combination of this and Approach 1 is probably best. |
Ya I agree that is a reasonable approach as well. I figured it wouldn't be a bad idea to contribute back to GBench if we could (or if they wanted it). I think ultimately both approaches would have to happen anyway since more benchmarks are coming in the future. |
They used to have a maximum and they removed it, so I am not sure they would want it. :) |
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
We continue to add more benchmarks, so I'm sure this is getting worse, not better. @dillon-cullinan do you have any recent data on the total benchmark suite runtime? |
@harrism No recent data, we turned off the dev builds to open up compute on our internal cluster. I'm okay with setting it up to run again a few times to get runtimes. |
Running all benchmarks for just 1 iteration today takes 67 minutes. Most time is spent on initializing the data rather than actual benchmarking functions. |
I wonder if there's some way to share input data for some of the benchmarks. |
How many benchmarks are initializing input data on host and then transferring to device instead of initializing in device directly? |
Almost all of them. |
single iteration of every benchmark.
CONTIGUOUS_SPLIT_BENCH has 6Gb, 4GB, 1GB data created on host. |
Agreed, this should be changed to generate data on the device. Building off of Mark's idea, we should look at trying to cache inputs generated with the same parameters. That would be an easy way to reuse inputs without needing to modify any benchmark code. |
To address part of #5773 This allows to run benchmarks for only specific iterations using environment variable `CUDF_BENCHMARK_ITERATIONS`. except when benchmark definition itself specifies iteration count. Also, makes pool as static to allocate pool memory resource only once per binary. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Conor Hoekstra (https://github.com/codereport) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #10060
addresses parts of #5773 - Add `create_sequence_table` which creates sequences in device (only numeric types supported) with/without nulls. - Add `create_random_null_mask` to create random null mask with given probability. (0.0-1.0 null probability) ~- add gnu++17 to generate_input.cu (temporarily for int128 STL support).~ - renamed `repeat_dtypes` to `cycle_dtypes` and moved out of create_* methods - updated ast bench, search, scatter , binary ops bench Splitting PR #10109 for review Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Conor Hoekstra (https://github.com/codereport) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #10300
To speedup generate benchmark input generation, move all data generation to device. To address #5773 (comment) This PR moves the random input generation to device. Rest all of the original work in this PR was split to multiple PRs and merged. #10277 #10278 #10279 #10280 #10281 #10300 With all of these changes, single iteration of all benchmark runs in <1000 seconds. (from 3067s to 964s). Running more iterations would see higher benefit too because the benchmark is restarted several times during run which again calls benchmark input generation code. closes #9857 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #10109
#10109 (comment)
Final runtime
|
Addresses part of #5773 uses `create_random_table` and moves benchmark input generation in device in reduction nvbench Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Christopher Harris (https://github.com/cwharris) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10486
Problem
The Google Benchmarks take 3+ hours to run to completion which is not sustainable if we want to provide regression feedback in a reasonable time to PR authors.
Goal
I would personally consider a reasonable time for a benchmark test to be about 1.5 hours (excluding source build time). Keep in mind we may be introducing more benchmarks in the future, and this will add more time.
Description
Here is a breakdown for the times it took to run each benchmark for one of our internal builds before any changes have been made:
These numbers are a better representation of the runtime of each benchmark relative to each other, and the times listed here can take much longer depending on setup/teardown time for GBench or data generation (I have seen 40+ minutes for Type Dispatcher, for example).
Solutions
Approach 1
Significantly lower the variations of parameters for the long-running benchmarks.
Type Dispatcher, for example, has too many parameter variations. One parameter range goes from 1024 --> 67 million going up by a power of two each interval. This can easily be cut down by doing a power of 4. This approach can be applied to many of our biggest offenders in runtime.
Approach 2
Introduce an "iteration ceiling" to GBench.
We can also cut down a lot of time by reducing the number of iterations for the smaller tests. GBench currently does not support an iteration ceiling, and only a hardcoded number of iterations via code or CLI. This isn't really useable as it can introduce significant jitter in the very fast tests. However, we can contribute to GBench and implement a ceiling function to solve the problem.
Related Info
Concatenate Improvement: #5743 : 22 min -> 16 min according to latest build
Reduction Improvement: #5744
Contributing back context: #5744 (comment)
The text was updated successfully, but these errors were encountered: