Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

Closed
dillon-cullinan opened this issue Jul 27, 2020 · 14 comments
Closed

[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

dillon-cullinan opened this issue Jul 27, 2020 · 14 comments
Assignees
Labels
libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue proposal Change current process or code tests Unit testing for project

Comments

@dillon-cullinan
Copy link
Contributor

dillon-cullinan commented Jul 27, 2020

Problem

The Google Benchmarks take 3+ hours to run to completion which is not sustainable if we want to provide regression feedback in a reasonable time to PR authors.

Goal

I would personally consider a reasonable time for a benchmark test to be about 1.5 hours (excluding source build time). Keep in mind we may be introducing more benchmarks in the future, and this will add more time.

Description

Here is a breakdown for the times it took to run each benchmark for one of our internal builds before any changes have been made:

00:35:48 TYPE_DISPATCHER_BENCH
00:28:40 COLUMN_CONCAT_BENCH
00:22:46 REDUCTION_BENCH
00:13:13 APPLY_BOOLEAN_MASK_BENCH
00:12:52 PARQUET_READER_BENCH
00:12:06 PARQUET_WRITER_BENCH
00:11:56 GATHER_BENCH
00:10:30 SCATTER_BENCH
00:10:16 ORC_READER_BENCH
00:06:38 CONTIGUOUS_SPLIT_BENCH
00:05:34 HASHING_BENCH
00:04:43 SHIFT_BENCH
00:04:27 SEARCH_BENCH
00:03:32 CSV_READER_BENCH
00:02:50 ITERATOR_BENCH
00:02:23 CSV_WRITER_BENCH
00:01:52 PARQUET_WRITER_CHUNKS_BENCH
00:01:50 ORC_WRITER_BENCH
00:01:24 TRANSPOSE_BENCH
00:00:59 JOIN_BENCH
00:00:53 GROUPBY_BENCH
00:00:36 NULLMASK_BENCH
00:00:35 MERGE_BENCH
00:00:05 SUBWORD_TOKENIZER_BENCH

These numbers are a better representation of the runtime of each benchmark relative to each other, and the times listed here can take much longer depending on setup/teardown time for GBench or data generation (I have seen 40+ minutes for Type Dispatcher, for example).

Solutions

Approach 1

Significantly lower the variations of parameters for the long-running benchmarks.

Type Dispatcher, for example, has too many parameter variations. One parameter range goes from 1024 --> 67 million going up by a power of two each interval. This can easily be cut down by doing a power of 4. This approach can be applied to many of our biggest offenders in runtime.

Approach 2

Introduce an "iteration ceiling" to GBench.

We can also cut down a lot of time by reducing the number of iterations for the smaller tests. GBench currently does not support an iteration ceiling, and only a hardcoded number of iterations via code or CLI. This isn't really useable as it can introduce significant jitter in the very fast tests. However, we can contribute to GBench and implement a ceiling function to solve the problem.

Related Info

Concatenate Improvement: #5743 : 22 min -> 16 min according to latest build
Reduction Improvement: #5744

Contributing back context: #5744 (comment)

@dillon-cullinan dillon-cullinan added proposal Change current process or code Needs Triage Need team to review and classify tests Unit testing for project libcudf Affects libcudf (C++/CUDA) code. labels Jul 27, 2020
@harrism
Copy link
Member

harrism commented Jul 30, 2020

@dillon-cullinan We can also provide our own base class / fixture for benchmarks which provides a built-in command line option to set max_iterations. Then all libcudf benchmarks can inherit that fixture and benefit from the limit.

I think a combination of this and Approach 1 is probably best.

@dillon-cullinan
Copy link
Contributor Author

Ya I agree that is a reasonable approach as well. I figured it wouldn't be a bad idea to contribute back to GBench if we could (or if they wanted it). I think ultimately both approaches would have to happen anyway since more benchmarks are coming in the future.

@harrism
Copy link
Member

harrism commented Jul 31, 2020

They used to have a maximum and they removed it, so I am not sure they would want it. :)

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Aug 5, 2020
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@harrism
Copy link
Member

harrism commented Feb 17, 2021

We continue to add more benchmarks, so I'm sure this is getting worse, not better. @dillon-cullinan do you have any recent data on the total benchmark suite runtime?

@github-actions github-actions bot removed the rotten label Feb 17, 2021
@dillon-cullinan
Copy link
Contributor Author

@harrism No recent data, we turned off the dev builds to open up compute on our internal cluster. I'm okay with setting it up to run again a few times to get runtimes.

@karthikeyann
Copy link
Contributor

Running all benchmarks for just 1 iteration today takes 67 minutes.
One of the issue with present benchmark Fixture is that it creates the memory pool in each Setup/TearDown.
After making it static,
all benchmarks for 1 iteration takes 51 minutes.
all benchmarks for 2 iterations takes 58 minutes.
all benchmarks for 4 iterations takes 73 minutes.

Most time is spent on initializing the data rather than actual benchmarking functions.
1 iteration of benchmarked code itself takes ~7.3 minutes only.

@harrism
Copy link
Member

harrism commented Dec 6, 2021

I wonder if there's some way to share input data for some of the benchmarks.

@jrhemstad
Copy link
Contributor

Most time is spent on initializing the data rather than actual benchmarking functions.
1 iteration of benchmarked code itself takes ~7.3 minutes only.

How many benchmarks are initializing input data on host and then transferring to device instead of initializing in device directly?

@jrhemstad jrhemstad added the Performance Performance related issue label Dec 7, 2021
@karthikeyann
Copy link
Contributor

karthikeyann commented Dec 7, 2021

Most time is spent on initializing the data rather than actual benchmarking functions.
1 iteration of benchmarked code itself takes ~7.3 minutes only.

How many benchmarks are initializing input data on host and then transferring to device instead of initializing in device directly?

Almost all of them.
https://github.com/rapidsai/cudf/blob/branch-22.02/cpp/benchmarks/common/generate_benchmark_input.hpp random data generator executes on host. This file could be starting point for optimization.

@karthikeyann
Copy link
Contributor

karthikeyann commented Dec 7, 2021

single iteration of every benchmark.

Bench Secs cumsum Secs% rev cum%
Running ./benchmarks/NULLMASK_BENCH 0.992 0.992 0.032341853 100
Running ./benchmarks/SCATTER_LISTS_BENCH 1.15 2.142 0.037493076 99.96765815
Running ./benchmarks/ITERATOR_BENCH 1.33 3.472 0.043361557 99.93016507
Running ./benchmarks/STREAM_COMPACTION_BENCH 2.614 6.086 0.085223392 99.88680351
Running ./benchmarks/MERGE_BENCH 4.115 10.201 0.134160007 99.80158012
Running ./benchmarks/TRANSPOSE_BENCH 5.007 15.208 0.163241593 99.66742011
Running ./benchmarks/COLUMN_CONCAT_BENCH 5.685 20.893 0.185346206 99.50417852
Running ./benchmarks/HASHING_BENCH 9.969 30.862 0.325016065 99.31883232
Running ./benchmarks/FILL_BENCH 15.57 46.432 0.507623646 98.99381625
Running ./benchmarks/BINARYOP_BENCH 17.153 63.585 0.559233681 98.4861926
Running ./benchmarks/SHIFT_BENCH 23.703 87.288 0.772781201 97.92695892
Running ./benchmarks/PARQUET_WRITER_CHUNKS_BENCH 24.154 111.442 0.787485007 97.15417772
Running ./benchmarks/COPY_IF_ELSE_BENCH 24.227 135.669 0.789865002 96.36669272
Running ./benchmarks/GROUPBY_BENCH 25.528 161.197 0.832281082 95.57682771
Running ./benchmarks/REPLACE_BENCH 30.899 192.096 1.007390048 94.74454663
Running ./benchmarks/SCATTER_BENCH 33.362 225.458 1.087690436 93.73715658
Running ./benchmarks/CSV_READER_BENCH 34.317 259.775 1.118825991 92.64946615
Running ./benchmarks/TEXT_BENCH 47.56 307.335 1.550583213 91.53064016
Running ./benchmarks/GATHER_BENCH 52.777 360.112 1.720671367 89.98005694
Running ./benchmarks/TYPE_DISPATCHER_BENCH 60.088 420.2 1.959029523 88.25938558
Running ./benchmarks/AST_BENCH 62.098 482.298 2.024560899 86.30035605
Running ./benchmarks/SORT_BENCH 68.567 550.865 2.235467602 84.27579515
Running ./benchmarks/CSV_WRITER_BENCH 76.154 627.019 2.482824096 82.04032755
Running ./benchmarks/REDUCTION_BENCH 98.295 725.314 3.204679918 79.55750346
Running ./benchmarks/PARQUET_READER_BENCH 134.25 859.564 4.37690909 76.35282354
Running ./benchmarks/SEARCH_BENCH 142.518 1002.082 4.646468006 71.97591445
Running ./benchmarks/QUANTILES_BENCH 144.707 1146.789 4.717835261 67.32944644
Running ./benchmarks/APPLY_BOOLEAN_MASK_BENCH 149.574 1296.363 4.876512479 62.61161118
Running ./benchmarks/PARQUET_WRITER_BENCH 167.921 1464.284 5.474673753 57.7350987
Running ./benchmarks/ORC_READER_BENCH 171.48 1635.764 5.590706673 52.26042495
Running ./benchmarks/JOIN_BENCH 184.689 1820.453 6.021355404 46.66971828
Running ./benchmarks/ORC_WRITER_BENCH 218.322 2038.775 7.117881165 40.64836287
Running ./benchmarks/JSON_BENCH 285.181 2323.956 9.297663399 33.53048171
Running ./benchmarks/STRINGS_BENCH 366.164 2690.12 11.93792581 24.23281831
Running ./benchmarks/CONTIGUOUS_SPLIT_BENCH 377.113 3067.233 12.2948925 12.2948925
Total 3067.233   100  

CONTIGUOUS_SPLIT_BENCH has 6Gb, 4GB, 1GB data created on host.
One common step to speed up all is to allow column_wrapper to be created in device using __device__ iterators.

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 7, 2021

Almost all of them.
https://github.com/rapidsai/cudf/blob/branch-22.02/cpp/benchmarks/common/generate_benchmark_input.hpp random data generator executes on host. This file could be starting point for optimization.

Agreed, this should be changed to generate data on the device.

Building off of Mark's idea, we should look at trying to cache inputs generated with the same parameters. That would be an easy way to reuse inputs without needing to modify any benchmark code.

@karthikeyann karthikeyann self-assigned this Dec 8, 2021
rapids-bot bot pushed a commit that referenced this issue Jan 25, 2022
To address part of #5773
This allows to run benchmarks for only specific iterations using environment variable `CUDF_BENCHMARK_ITERATIONS`.
except when benchmark definition  itself specifies iteration count.

Also, makes pool as static to allocate pool memory resource only once per binary.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10060
rapids-bot bot pushed a commit that referenced this issue Feb 25, 2022
addresses parts of #5773
- Add `create_sequence_table` which creates sequences in device (only numeric types supported) with/without nulls.
- Add `create_random_null_mask` to create random null mask with given probability. (0.0-1.0 null probability)
~- add gnu++17 to generate_input.cu (temporarily for  int128 STL support).~
- renamed `repeat_dtypes` to `cycle_dtypes` and moved out of create_* methods
- updated ast bench, search, scatter , binary ops bench


Splitting PR #10109 for review

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)
  - MithunR (https://github.com/mythrocks)
  - Bradley Dice (https://github.com/bdice)

URL: #10300
rapids-bot bot pushed a commit that referenced this issue Mar 22, 2022
To speedup generate benchmark input generation, move all data generation to device.
To address #5773 (comment)
This PR moves the random input generation to device.

Rest all of the original work in this PR was split to multiple PRs and merged.
#10277
#10278
#10279
#10280
#10281
#10300

With all of these changes, single iteration of all benchmark runs in <1000 seconds. (from 3067s to 964s).
Running more iterations would see higher benefit too because the benchmark is restarted several times during run which again calls benchmark input generation code.

closes #9857

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)

URL: #10109
@karthikeyann
Copy link
Contributor

karthikeyann commented Mar 22, 2022

#10109 (comment)
Now main contributors of runtime are benchmarks itself.

  • REDUCTION NVBENCH
  • JOIN NVBENCH
  • cuIO benchmarks
  • STRINGS_BENCH StringReplace benchmarks.

Final runtime

Bench Seconds Seconds%
REDUCTION_NVBENCH 175 14.92116
PARQUET_WRITER_BENCH 134.922 11.50396
JOIN_BENCH 134.391 11.45868
ORC_WRITER_BENCH 109.489 9.335446
STRINGS_BENCH 83.287 7.101364
ORC_READER_BENCH 81.172 6.921031
CSV_WRITER_BENCH 76.628 6.533593
PARQUET_READER_BENCH 68.16 5.811579
JOIN_NVBENCH 59.513 5.074303
PARQUET_WRITER_CHUNKS_BENCH 48.677 4.150385
CSV_READER_BENCH 36.325 3.097207
STREAM_COMPACTION_NVBENCH 29.748 2.536427
QUANTILES_BENCH 26.583 2.266567
TEXT_BENCH 13.615 1.160866
CONTIGUOUS_SPLIT_BENCH 13.076 1.114909
TYPE_DISPATCHER_BENCH 11.781 1.004493
SORT_BENCH 11.2 0.954954
SEARCH_BENCH 9.918 0.845646
JSON_BENCH 6.085 0.51883
COLUMN_CONCAT_BENCH 5.031 0.428962
APPLY_BOOLEAN_MASK_BENCH 4.512 0.38471
MERGE_BENCH 4.19 0.357255
GROUPBY_BENCH 3.296 0.281029
BINARYOP_BENCH 3.254 0.277448
REDUCTION_BENCH 2.73 0.23277
SCATTER_BENCH 2.149 0.183232
AST_BENCH 2.077 0.177093
TRANSPOSE_BENCH 2.068 0.176325
GATHER_BENCH 2.049 0.174705
REPLACE_BENCH 1.72 0.146654
HASHING_BENCH 1.667 0.142135
COPY_IF_ELSE_BENCH 1.62 0.138127
SHIFT_BENCH 1.596 0.136081
SCATTER_LISTS_BENCH 1.467 0.125082
FILL_BENCH 1.33 0.113401
ITERATOR_BENCH 1.33 0.113401
NULLMASK_BENCH 1.175 0.100185
total 1172.831 100

rapids-bot bot pushed a commit that referenced this issue Mar 28, 2022
Addresses part of #5773
uses `create_random_table` and moves benchmark input generation in device in reduction nvbench

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10486
@GregoryKimball
Copy link
Contributor

I suggest we close this due to the great progress in #10109, #10277, #10278, #10279, #10280, #10281, #10486, #10677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue proposal Change current process or code tests Unit testing for project
Projects
None yet
Development

No branches or pull requests

7 participants