[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

dillon-cullinan · 2020-07-27T16:51:51Z

Problem

The Google Benchmarks take 3+ hours to run to completion which is not sustainable if we want to provide regression feedback in a reasonable time to PR authors.

Goal

I would personally consider a reasonable time for a benchmark test to be about 1.5 hours (excluding source build time). Keep in mind we may be introducing more benchmarks in the future, and this will add more time.

Description

Here is a breakdown for the times it took to run each benchmark for one of our internal builds before any changes have been made:

00:35:48 TYPE_DISPATCHER_BENCH
00:28:40 COLUMN_CONCAT_BENCH
00:22:46 REDUCTION_BENCH
00:13:13 APPLY_BOOLEAN_MASK_BENCH
00:12:52 PARQUET_READER_BENCH
00:12:06 PARQUET_WRITER_BENCH
00:11:56 GATHER_BENCH
00:10:30 SCATTER_BENCH
00:10:16 ORC_READER_BENCH
00:06:38 CONTIGUOUS_SPLIT_BENCH
00:05:34 HASHING_BENCH
00:04:43 SHIFT_BENCH
00:04:27 SEARCH_BENCH
00:03:32 CSV_READER_BENCH
00:02:50 ITERATOR_BENCH
00:02:23 CSV_WRITER_BENCH
00:01:52 PARQUET_WRITER_CHUNKS_BENCH
00:01:50 ORC_WRITER_BENCH
00:01:24 TRANSPOSE_BENCH
00:00:59 JOIN_BENCH
00:00:53 GROUPBY_BENCH
00:00:36 NULLMASK_BENCH
00:00:35 MERGE_BENCH
00:00:05 SUBWORD_TOKENIZER_BENCH

These numbers are a better representation of the runtime of each benchmark relative to each other, and the times listed here can take much longer depending on setup/teardown time for GBench or data generation (I have seen 40+ minutes for Type Dispatcher, for example).

Solutions

Approach 1

Significantly lower the variations of parameters for the long-running benchmarks.

Type Dispatcher, for example, has too many parameter variations. One parameter range goes from 1024 --> 67 million going up by a power of two each interval. This can easily be cut down by doing a power of 4. This approach can be applied to many of our biggest offenders in runtime.

Approach 2

Introduce an "iteration ceiling" to GBench.

We can also cut down a lot of time by reducing the number of iterations for the smaller tests. GBench currently does not support an iteration ceiling, and only a hardcoded number of iterations via code or CLI. This isn't really useable as it can introduce significant jitter in the very fast tests. However, we can contribute to GBench and implement a ceiling function to solve the problem.

Related Info

Concatenate Improvement: #5743 : 22 min -> 16 min according to latest build
Reduction Improvement: #5744

Contributing back context: #5744 (comment)

The text was updated successfully, but these errors were encountered:

harrism · 2020-07-30T07:01:03Z

@dillon-cullinan We can also provide our own base class / fixture for benchmarks which provides a built-in command line option to set max_iterations. Then all libcudf benchmarks can inherit that fixture and benefit from the limit.

I think a combination of this and Approach 1 is probably best.

dillon-cullinan · 2020-07-31T04:41:23Z

Ya I agree that is a reasonable approach as well. I figured it wouldn't be a bad idea to contribute back to GBench if we could (or if they wanted it). I think ultimately both approaches would have to happen anyway since more benchmarks are coming in the future.

harrism · 2020-07-31T05:23:12Z

They used to have a maximum and they removed it, so I am not sure they would want it. :)

github-actions · 2021-02-17T00:48:04Z

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

harrism · 2021-02-17T00:50:06Z

We continue to add more benchmarks, so I'm sure this is getting worse, not better. @dillon-cullinan do you have any recent data on the total benchmark suite runtime?

dillon-cullinan · 2021-03-04T21:26:04Z

@harrism No recent data, we turned off the dev builds to open up compute on our internal cluster. I'm okay with setting it up to run again a few times to get runtimes.

karthikeyann · 2021-12-04T11:53:58Z

Running all benchmarks for just 1 iteration today takes 67 minutes.
One of the issue with present benchmark Fixture is that it creates the memory pool in each Setup/TearDown.
After making it static,
all benchmarks for 1 iteration takes 51 minutes.
all benchmarks for 2 iterations takes 58 minutes.
all benchmarks for 4 iterations takes 73 minutes.

Most time is spent on initializing the data rather than actual benchmarking functions.
1 iteration of benchmarked code itself takes ~7.3 minutes only.

harrism · 2021-12-06T21:24:05Z

I wonder if there's some way to share input data for some of the benchmarks.

jrhemstad · 2021-12-07T17:07:54Z

Most time is spent on initializing the data rather than actual benchmarking functions.
1 iteration of benchmarked code itself takes ~7.3 minutes only.

How many benchmarks are initializing input data on host and then transferring to device instead of initializing in device directly?

karthikeyann · 2021-12-07T18:02:04Z

Most time is spent on initializing the data rather than actual benchmarking functions.
1 iteration of benchmarked code itself takes ~7.3 minutes only.

How many benchmarks are initializing input data on host and then transferring to device instead of initializing in device directly?

Almost all of them.
https://github.com/rapidsai/cudf/blob/branch-22.02/cpp/benchmarks/common/generate_benchmark_input.hpp random data generator executes on host. This file could be starting point for optimization.

karthikeyann · 2021-12-07T18:04:58Z

single iteration of every benchmark.

Bench	Secs	cumsum	Secs%	rev cum%
Running ./benchmarks/NULLMASK_BENCH	0.992	0.992	0.032341853	100
Running ./benchmarks/SCATTER_LISTS_BENCH	1.15	2.142	0.037493076	99.96765815
Running ./benchmarks/ITERATOR_BENCH	1.33	3.472	0.043361557	99.93016507
Running ./benchmarks/STREAM_COMPACTION_BENCH	2.614	6.086	0.085223392	99.88680351
Running ./benchmarks/MERGE_BENCH	4.115	10.201	0.134160007	99.80158012
Running ./benchmarks/TRANSPOSE_BENCH	5.007	15.208	0.163241593	99.66742011
Running ./benchmarks/COLUMN_CONCAT_BENCH	5.685	20.893	0.185346206	99.50417852
Running ./benchmarks/HASHING_BENCH	9.969	30.862	0.325016065	99.31883232
Running ./benchmarks/FILL_BENCH	15.57	46.432	0.507623646	98.99381625
Running ./benchmarks/BINARYOP_BENCH	17.153	63.585	0.559233681	98.4861926
Running ./benchmarks/SHIFT_BENCH	23.703	87.288	0.772781201	97.92695892
Running ./benchmarks/PARQUET_WRITER_CHUNKS_BENCH	24.154	111.442	0.787485007	97.15417772
Running ./benchmarks/COPY_IF_ELSE_BENCH	24.227	135.669	0.789865002	96.36669272
Running ./benchmarks/GROUPBY_BENCH	25.528	161.197	0.832281082	95.57682771
Running ./benchmarks/REPLACE_BENCH	30.899	192.096	1.007390048	94.74454663
Running ./benchmarks/SCATTER_BENCH	33.362	225.458	1.087690436	93.73715658
Running ./benchmarks/CSV_READER_BENCH	34.317	259.775	1.118825991	92.64946615
Running ./benchmarks/TEXT_BENCH	47.56	307.335	1.550583213	91.53064016
Running ./benchmarks/GATHER_BENCH	52.777	360.112	1.720671367	89.98005694
Running ./benchmarks/TYPE_DISPATCHER_BENCH	60.088	420.2	1.959029523	88.25938558
Running ./benchmarks/AST_BENCH	62.098	482.298	2.024560899	86.30035605
Running ./benchmarks/SORT_BENCH	68.567	550.865	2.235467602	84.27579515
Running ./benchmarks/CSV_WRITER_BENCH	76.154	627.019	2.482824096	82.04032755
Running ./benchmarks/REDUCTION_BENCH	98.295	725.314	3.204679918	79.55750346
Running ./benchmarks/PARQUET_READER_BENCH	134.25	859.564	4.37690909	76.35282354
Running ./benchmarks/SEARCH_BENCH	142.518	1002.082	4.646468006	71.97591445
Running ./benchmarks/QUANTILES_BENCH	144.707	1146.789	4.717835261	67.32944644
Running ./benchmarks/APPLY_BOOLEAN_MASK_BENCH	149.574	1296.363	4.876512479	62.61161118
Running ./benchmarks/PARQUET_WRITER_BENCH	167.921	1464.284	5.474673753	57.7350987
Running ./benchmarks/ORC_READER_BENCH	171.48	1635.764	5.590706673	52.26042495
Running ./benchmarks/JOIN_BENCH	184.689	1820.453	6.021355404	46.66971828
Running ./benchmarks/ORC_WRITER_BENCH	218.322	2038.775	7.117881165	40.64836287
Running ./benchmarks/JSON_BENCH	285.181	2323.956	9.297663399	33.53048171
Running ./benchmarks/STRINGS_BENCH	366.164	2690.12	11.93792581	24.23281831
Running ./benchmarks/CONTIGUOUS_SPLIT_BENCH	377.113	3067.233	12.2948925	12.2948925
Total	3067.233		100

CONTIGUOUS_SPLIT_BENCH has 6Gb, 4GB, 1GB data created on host.
One common step to speed up all is to allow column_wrapper to be created in device using __device__ iterators.

jrhemstad · 2021-12-07T18:09:55Z

Almost all of them.
https://github.com/rapidsai/cudf/blob/branch-22.02/cpp/benchmarks/common/generate_benchmark_input.hpp random data generator executes on host. This file could be starting point for optimization.

Agreed, this should be changed to generate data on the device.

Building off of Mark's idea, we should look at trying to cache inputs generated with the same parameters. That would be an easy way to reuse inputs without needing to modify any benchmark code.

To address part of #5773 This allows to run benchmarks for only specific iterations using environment variable `CUDF_BENCHMARK_ITERATIONS`. except when benchmark definition itself specifies iteration count. Also, makes pool as static to allocate pool memory resource only once per binary. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Conor Hoekstra (https://github.com/codereport) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #10060

addresses parts of #5773 - Add `create_sequence_table` which creates sequences in device (only numeric types supported) with/without nulls. - Add `create_random_null_mask` to create random null mask with given probability. (0.0-1.0 null probability) ~- add gnu++17 to generate_input.cu (temporarily for int128 STL support).~ - renamed `repeat_dtypes` to `cycle_dtypes` and moved out of create_* methods - updated ast bench, search, scatter , binary ops bench Splitting PR #10109 for review Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Conor Hoekstra (https://github.com/codereport) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #10300

To speedup generate benchmark input generation, move all data generation to device. To address #5773 (comment) This PR moves the random input generation to device. Rest all of the original work in this PR was split to multiple PRs and merged. #10277 #10278 #10279 #10280 #10281 #10300 With all of these changes, single iteration of all benchmark runs in <1000 seconds. (from 3067s to 964s). Running more iterations would see higher benefit too because the benchmark is restarted several times during run which again calls benchmark input generation code. closes #9857 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #10109

karthikeyann · 2022-03-22T15:51:59Z

#10109 (comment)
Now main contributors of runtime are benchmarks itself.

REDUCTION NVBENCH
JOIN NVBENCH
cuIO benchmarks
STRINGS_BENCH StringReplace benchmarks.

Final runtime

Bench	Seconds	Seconds%
REDUCTION_NVBENCH	175	14.92116
PARQUET_WRITER_BENCH	134.922	11.50396
JOIN_BENCH	134.391	11.45868
ORC_WRITER_BENCH	109.489	9.335446
STRINGS_BENCH	83.287	7.101364
ORC_READER_BENCH	81.172	6.921031
CSV_WRITER_BENCH	76.628	6.533593
PARQUET_READER_BENCH	68.16	5.811579
JOIN_NVBENCH	59.513	5.074303
PARQUET_WRITER_CHUNKS_BENCH	48.677	4.150385
CSV_READER_BENCH	36.325	3.097207
STREAM_COMPACTION_NVBENCH	29.748	2.536427
QUANTILES_BENCH	26.583	2.266567
TEXT_BENCH	13.615	1.160866
CONTIGUOUS_SPLIT_BENCH	13.076	1.114909
TYPE_DISPATCHER_BENCH	11.781	1.004493
SORT_BENCH	11.2	0.954954
SEARCH_BENCH	9.918	0.845646
JSON_BENCH	6.085	0.51883
COLUMN_CONCAT_BENCH	5.031	0.428962
APPLY_BOOLEAN_MASK_BENCH	4.512	0.38471
MERGE_BENCH	4.19	0.357255
GROUPBY_BENCH	3.296	0.281029
BINARYOP_BENCH	3.254	0.277448
REDUCTION_BENCH	2.73	0.23277
SCATTER_BENCH	2.149	0.183232
AST_BENCH	2.077	0.177093
TRANSPOSE_BENCH	2.068	0.176325
GATHER_BENCH	2.049	0.174705
REPLACE_BENCH	1.72	0.146654
HASHING_BENCH	1.667	0.142135
COPY_IF_ELSE_BENCH	1.62	0.138127
SHIFT_BENCH	1.596	0.136081
SCATTER_LISTS_BENCH	1.467	0.125082
FILL_BENCH	1.33	0.113401
ITERATOR_BENCH	1.33	0.113401
NULLMASK_BENCH	1.175	0.100185
total	1172.831	100

Addresses part of #5773 uses `create_random_table` and moves benchmark input generation in device in reduction nvbench Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Christopher Harris (https://github.com/cwharris) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10486

GregoryKimball · 2022-06-30T23:54:57Z

I suggest we close this due to the great progress in #10109, #10277, #10278, #10279, #10280, #10281, #10486, #10677

dillon-cullinan added proposal Change current process or code Needs Triage Need team to review and classify tests Unit testing for project libcudf Affects libcudf (C++/CUDA) code. labels Jul 27, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Aug 5, 2020

github-actions bot added the rotten label Feb 17, 2021

github-actions bot removed the rotten label Feb 17, 2021

jrhemstad assigned robertmaynard Dec 7, 2021

jrhemstad added the Performance Performance related issue label Dec 7, 2021

robertmaynard added this to the C++ Benchmark Runtime Improvements milestone Dec 7, 2021

This was referenced Dec 7, 2021

[FEA] Improve performance of benchmark input generation #9857

Closed

[FEA] Reduce benchmark search space where applicable #9858

Open

karthikeyann mentioned this issue Dec 8, 2021

move benchmark input generation to GPU for CONTIGUOUS_SPLIT_BENCH #9871

Closed

karthikeyann self-assigned this Dec 8, 2021

This was referenced Jan 17, 2022

Limit benchmark iterations using environment variable #10060

Merged

generate benchmark input in device #10109

Merged

karthikeyann mentioned this issue Feb 17, 2022

Add device create_sequence_table for benchmarks #10300

Merged

karthikeyann mentioned this issue Mar 22, 2022

move benchmark input generation in device in reduction nvbench #10486

Merged

GregoryKimball closed this as completed Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

dillon-cullinan commented Jul 27, 2020 •

edited

Loading

harrism commented Jul 30, 2020

dillon-cullinan commented Jul 31, 2020

harrism commented Jul 31, 2020

github-actions bot commented Feb 17, 2021

harrism commented Feb 17, 2021

dillon-cullinan commented Mar 4, 2021

karthikeyann commented Dec 4, 2021

harrism commented Dec 6, 2021

jrhemstad commented Dec 7, 2021

karthikeyann commented Dec 7, 2021 •

edited

Loading

karthikeyann commented Dec 7, 2021 •

edited

Loading

jrhemstad commented Dec 7, 2021 •

edited

Loading

karthikeyann commented Mar 22, 2022 •

edited

Loading

GregoryKimball commented Jun 30, 2022

[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

[DISCUSSION] Lower Google Benchmark Suite Runtime #5773

Comments

dillon-cullinan commented Jul 27, 2020 • edited Loading

harrism commented Jul 30, 2020

dillon-cullinan commented Jul 31, 2020

harrism commented Jul 31, 2020

github-actions bot commented Feb 17, 2021

harrism commented Feb 17, 2021

dillon-cullinan commented Mar 4, 2021

karthikeyann commented Dec 4, 2021

harrism commented Dec 6, 2021

jrhemstad commented Dec 7, 2021

karthikeyann commented Dec 7, 2021 • edited Loading

karthikeyann commented Dec 7, 2021 • edited Loading

jrhemstad commented Dec 7, 2021 • edited Loading

karthikeyann commented Mar 22, 2022 • edited Loading

GregoryKimball commented Jun 30, 2022

dillon-cullinan commented Jul 27, 2020 •

edited

Loading

karthikeyann commented Dec 7, 2021 •

edited

Loading

karthikeyann commented Dec 7, 2021 •

edited

Loading

jrhemstad commented Dec 7, 2021 •

edited

Loading

karthikeyann commented Mar 22, 2022 •

edited

Loading