Fix fallback to sort aggregation for grouping only hash aggregate #9891

abellina · 2021-12-12T13:49:19Z

The following fixes what looks like an unintended fallback to sort aggregate introduced in #9545 for a grouping only (no aggregation request) case.

In the PR, the std::all_of function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the std::all_of function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650).

I added a benchmark group_no_requests_benchmark.cu by mostly copying group_sum_benchmark.cu but I changed one critical part. I am re-creating the groupby object for each state:

  for (auto _ : state) {
    cuda_event_timer timer(state, true);
    cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e
    auto result = gb_obj.aggregate(requests);
  }

This shows what would happen in the scenario where the groupby instance is created each time an aggregate is issued, which would re-create the helper each time for the sorted case.

If the groupby object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a groupby instance when using the groupby API from Spark.

Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly.

This was executed on a T4 GPU.

Before the patch:

Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3

Full output

2021-12-12T13:41:08+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2801.89 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 1.01, 0.78, 0.42
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.117 ms        0.143 ms         5906
Groupby/Basic/1000000/manual_time                      0.524 ms        0.542 ms         1352
Groupby/Basic/10000000/manual_time                      4.41 ms         4.43 ms          159
Groupby/Basic/100000000/manual_time                     50.1 ms         50.1 ms           13
Groupby/PreSorted/1000000/manual_time                  0.332 ms        0.350 ms         2118
Groupby/PreSorted/10000000/manual_time                  2.22 ms         2.23 ms          315
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           30
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4381
Groupby/PreSortedNth/10000000/manual_time              0.890 ms        0.917 ms          774
Groupby/PreSortedNth/100000000/manual_time              8.43 ms         8.46 ms           68
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          902
Groupby/Shift/10000000/manual_time                      9.51 ms         9.53 ms           63
Groupby/Shift/100000000/manual_time                      145 ms          145 ms            4
Groupby/Aggregation/10000/manual_time                   1.56 ms         1.58 ms          442
Groupby/Aggregation/16384/manual_time                   1.59 ms         1.62 ms          435
Groupby/Aggregation/65536/manual_time                   1.73 ms         1.76 ms          404
Groupby/Aggregation/262144/manual_time                  2.95 ms         2.98 ms          237
Groupby/Aggregation/1048576/manual_time                 9.20 ms         9.23 ms           73
Groupby/Aggregation/4194304/manual_time                 36.3 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                92.0 ms         92.1 ms            7
Groupby/Scan/10000/manual_time                          1.56 ms         1.58 ms          447
Groupby/Scan/16384/manual_time                          1.62 ms         1.65 ms          429
Groupby/Scan/65536/manual_time                          1.85 ms         1.88 ms          378
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          197
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.6 ms         48.6 ms           14
Groupby/Scan/10000000/manual_time                        126 ms          126 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
Groupby/PreSortedNoRequests/1000000/manual_time        0.194 ms        0.214 ms         3624
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          571
Groupby/PreSortedNoRequests/100000000/manual_time       12.6 ms         12.7 ms           50

After the patch:

Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21

Full output

2021-12-12T13:37:50+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2654.22 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 0.64, 0.50, 0.26
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.116 ms        0.142 ms         5918
Groupby/Basic/1000000/manual_time                      0.523 ms        0.542 ms         1374
Groupby/Basic/10000000/manual_time                      4.37 ms         4.39 ms          162
Groupby/Basic/100000000/manual_time                     51.4 ms         51.5 ms           10
Groupby/PreSorted/1000000/manual_time                  0.331 ms        0.350 ms         2121
Groupby/PreSorted/10000000/manual_time                  2.21 ms         2.23 ms          316
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           27
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4384
Groupby/PreSortedNth/10000000/manual_time              0.888 ms        0.915 ms          775
Groupby/PreSortedNth/100000000/manual_time              8.36 ms         8.39 ms           70
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          904
Groupby/Shift/10000000/manual_time                      9.50 ms         9.52 ms           63
Groupby/Shift/100000000/manual_time                      146 ms          146 ms            4
Groupby/Aggregation/10000/manual_time                   1.53 ms         1.55 ms          446
Groupby/Aggregation/16384/manual_time                   1.58 ms         1.61 ms          437
Groupby/Aggregation/65536/manual_time                   1.72 ms         1.75 ms          405
Groupby/Aggregation/262144/manual_time                  2.93 ms         2.96 ms          236
Groupby/Aggregation/1048576/manual_time                 9.18 ms         9.21 ms           74
Groupby/Aggregation/4194304/manual_time                 36.2 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                91.5 ms         91.6 ms            7
Groupby/Scan/10000/manual_time                          1.55 ms         1.57 ms          452
Groupby/Scan/16384/manual_time                          1.60 ms         1.62 ms          434
Groupby/Scan/65536/manual_time                          1.84 ms         1.87 ms          379
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          198
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.4 ms         48.4 ms           14
Groupby/Scan/10000000/manual_time                        125 ms          125 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
Groupby/PreSortedNoRequests/1000000/manual_time        0.195 ms        0.215 ms         3604
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          575
Groupby/PreSortedNoRequests/100000000/manual_time       12.7 ms         12.8 ms           50

abellina · 2021-12-12T14:30:10Z

Trying to figure out why this failure:

>>>> PASSED: clang format check
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
>>>> PASSED: libcudf header existence conda/recipes/libcudf/meta.yaml check
>>>> PASSED: libcudf header existence conda/recipes/libcudf/meta.yaml check
Build step 'Execute shell' marked build as failure
[Set GitHub commit status (universal)] ERROR on repos [GHRepository@6e42889e[nodeId=MDEwOlJlcG9zaXRvcnk5MDUwNjkxOA==,description=cuDF - GPU DataFrame Library ,homepage=http://rapids.ai,name=cudf,fork=false,archived=false,size=93286,milestones={},language=C++,commits={},source=<null>,parent=<null>,isTemplate=false,url=https://api.github.com/repos/rapidsai/cudf,id=90506918,nodeId=<null>,createdAt=2017-05-07T03:43:37Z,updatedAt=2021-12-12T11:11:15Z]] (sha:10afade) with context:gpuCI/cudf/check/style

abellina · 2021-12-12T15:11:30Z

I ran: python3 ./cpp/scripts/run-clang-format.py -inplace but no diff or errors seen, so I am not sure where the error above comes from.

ttnghia · 2021-12-12T15:13:31Z

cmake-format.............................................................Failed

That's not cpp style issue. It's cmake issue, maybe you need to check CmakeList.txt to remove spaces etc. (or run cmake format).

codecov · 2021-12-12T17:17:14Z

Codecov Report

Merging #9891 (5a5045a) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

❗ Current head 5a5045a differs from pull request most recent head b7d2107. Consider uploading reports for the commit b7d2107 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.02    #9891      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20479     +174     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18345     +170

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/multiindex.py	`0.00% <0.00%> (ø)`
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e581734...b7d2107. Read the comment docs.

jrhemstad · 2021-12-12T17:46:17Z

cpp/src/groupby/hash/groupby.cu

@@ -647,12 +647,12 @@ bool can_use_hash_groupby(table_view const& keys, host_span<aggregation_request
  // Currently, structs are not supported in any of hash-based aggregations.
  // Therefore, if any request contains structs then we must fallback to sort-based aggregations.
  // TODO: Support structs in hash-based aggregations.
-  auto const has_struct =
+  auto const all_no_structs =


Keep the has_structs name and this should be a std::any_of for r.values.type().id() == type_id::STRUCT.

jrhemstad · 2021-12-12T17:51:04Z

a grouping only (no aggregation request)

I agree the logic needs to be fixed, but can you elaborate on what the use case is for a groupby::aggregate without an aggregation? I wasn't aware this is something that was even supported.

abellina · 2021-12-13T05:22:26Z

can you elaborate on what the use case is for a groupby::aggregate without an aggregation? I wasn't aware this is something that was even supported.

This type of call is done when we are trying to compute a DISTINCT aggregation.

Spark implements the first aggregation for the distinct (the one finding the unique rows) by performing a group-by where the "keys" are both the requested set of keys (if the user specified GROUP BY) and the columns being aggregated (select count(distinct foo) would add foo to the keys). The product of this first group-by without aggregate functions are unique rows we can then use to perform an aggregate (like count, sum, etc).

So for an example such as: select count(distinct my_col) from my_table. This doesn't have a GROUP BY clause, but it boils down to a group-by aggregate in cuDF terms:

Perform a group-by with my_col as the key. Do not request a count aggregation yet as we would be counting non-distinct. This is the call that sees no aggregation requests to cuDF.
Partition + shuffle so we can distribute this and reduce.
Perform an aggregate where we count these unique rows. This is a reduction for this example, but in the case where the user specified a GROUP BY it would be another group-by aggregate with a count aggregate request, since we'd care about the grouping key.
Results are merged together for each individual batch, that would be another reduction with sum in this case (or a group by with a sum aggregation request for the GROUP BY case).

jrhemstad · 2021-12-13T17:07:26Z

This type of call is done when we are trying to compute a DISTINCT aggregation.

Wouldn't you just use drop_duplicates to get the distinct keys?

cudf/cpp/include/cudf/stream_compaction.hpp

Line 238 in d23bcb4

std::unique_ptr<table> drop_duplicates(

There's a lot of extra machinery going on in a groupby that doing this operation through a groupby w/ no aggregation would be possibly inefficient or this functionality could be removed, etc.

jrhemstad · 2021-12-13T17:11:25Z

cpp/src/groupby/hash/groupby.cu

+  auto const has_structs =
+    std::any_of(requests.begin(), requests.end(), [](aggregation_request const& r) {


Looking at this again, this is silly to progress the list of aggregation_requests twice. Just put the struct check above:

auto const all_hash_aggregations = std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) { return not r.values.type().id() == type_id::STRUCT and cudf::has_atomic_support(r.values.type()) and std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) { return is_hash_aggregation(a->kind); }); });

sure thing, will push a patch shortly

wait I think this needs to be:

auto const all_hash_aggregations = requests.empty() or std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) { return not r.values.type().id() == type_id::STRUCT and cudf::has_atomic_support(r.values.type()) and std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) { return is_hash_aggregation(a->kind); }); });

std::all_of returns true if the range is empty.

all_of already returns true for an empty range.

oh that's right :)

jlowe · 2021-12-13T17:17:04Z

Wouldn't you just use drop_duplicates to get the distinct keys?

drop_duplicates performs a stable sort which seems much more expensive in practice than the hash-based approach. The large perf regression caused by accidentally falling back to a sort-based aggregation implies this as well.

jrhemstad · 2021-12-13T17:28:25Z

drop_duplicates performs a stable sort which seems much more expensive in practice than the hash-based approach. The large perf regression caused by accidentally falling back to a sort-based aggregation implies this as well.

Agreed, but I want to fix that with this: #9413

I'm honestly just surprised a hash groupby without any aggregations works. A hash-based drop_duplicates should be faster still.

codereport

lgtm 👍

abellina · 2021-12-13T21:08:36Z

@gpucibot merge

…pidsai#9891) The following fixes what looks like an unintended fallback to sort aggregate introduced in rapidsai#9545 for a grouping only (no aggregation request) case. In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650). I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`: ``` for (auto _ : state) { cuda_event_timer timer(state, true); cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e auto result = gb_obj.aggregate(requests); } ``` This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case. If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark. Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly. This was executed on a T4 GPU. Before the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 ``` <details> <summary>Full output</summary> <p> ``` 2021-12-12T13:41:08+00:00 Running ./GROUPBY_BENCH Run on (64 X 2801.89 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 1024 KiB (x32) L3 Unified 22528 KiB (x2) Load Average: 1.01, 0.78, 0.42 -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------- Groupby/Basic/10000/manual_time 0.117 ms 0.143 ms 5906 Groupby/Basic/1000000/manual_time 0.524 ms 0.542 ms 1352 Groupby/Basic/10000000/manual_time 4.41 ms 4.43 ms 159 Groupby/Basic/100000000/manual_time 50.1 ms 50.1 ms 13 Groupby/PreSorted/1000000/manual_time 0.332 ms 0.350 ms 2118 Groupby/PreSorted/10000000/manual_time 2.22 ms 2.23 ms 315 Groupby/PreSorted/100000000/manual_time 22.2 ms 22.2 ms 30 Groupby/PreSortedNth/1000000/manual_time 0.160 ms 0.188 ms 4381 Groupby/PreSortedNth/10000000/manual_time 0.890 ms 0.917 ms 774 Groupby/PreSortedNth/100000000/manual_time 8.43 ms 8.46 ms 68 Groupby/Shift/1000000/manual_time 0.764 ms 0.785 ms 902 Groupby/Shift/10000000/manual_time 9.51 ms 9.53 ms 63 Groupby/Shift/100000000/manual_time 145 ms 145 ms 4 Groupby/Aggregation/10000/manual_time 1.56 ms 1.58 ms 442 Groupby/Aggregation/16384/manual_time 1.59 ms 1.62 ms 435 Groupby/Aggregation/65536/manual_time 1.73 ms 1.76 ms 404 Groupby/Aggregation/262144/manual_time 2.95 ms 2.98 ms 237 Groupby/Aggregation/1048576/manual_time 9.20 ms 9.23 ms 73 Groupby/Aggregation/4194304/manual_time 36.3 ms 36.3 ms 19 Groupby/Aggregation/10000000/manual_time 92.0 ms 92.1 ms 7 Groupby/Scan/10000/manual_time 1.56 ms 1.58 ms 447 Groupby/Scan/16384/manual_time 1.62 ms 1.65 ms 429 Groupby/Scan/65536/manual_time 1.85 ms 1.88 ms 378 Groupby/Scan/262144/manual_time 3.54 ms 3.56 ms 197 Groupby/Scan/1048576/manual_time 12.0 ms 12.0 ms 57 Groupby/Scan/4194304/manual_time 48.6 ms 48.6 ms 14 Groupby/Scan/10000000/manual_time 126 ms 126 ms 4 Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 Groupby/PreSortedNoRequests/1000000/manual_time 0.194 ms 0.214 ms 3624 Groupby/PreSortedNoRequests/10000000/manual_time 1.25 ms 1.27 ms 571 Groupby/PreSortedNoRequests/100000000/manual_time 12.6 ms 12.7 ms 50 ``` </details> After the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.058 ms 0.085 ms 11991 Groupby/BasicNoRequest/1000000/manual_time 0.282 ms 0.301 ms 2478 Groupby/BasicNoRequest/10000000/manual_time 2.42 ms 2.44 ms 291 Groupby/BasicNoRequest/100000000/manual_time 29.2 ms 29.2 ms 21 ``` <details> <summary>Full output</summary> <p> ``` 2021-12-12T13:37:50+00:00 Running ./GROUPBY_BENCH Run on (64 X 2654.22 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 1024 KiB (x32) L3 Unified 22528 KiB (x2) Load Average: 0.64, 0.50, 0.26 -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------- Groupby/Basic/10000/manual_time 0.116 ms 0.142 ms 5918 Groupby/Basic/1000000/manual_time 0.523 ms 0.542 ms 1374 Groupby/Basic/10000000/manual_time 4.37 ms 4.39 ms 162 Groupby/Basic/100000000/manual_time 51.4 ms 51.5 ms 10 Groupby/PreSorted/1000000/manual_time 0.331 ms 0.350 ms 2121 Groupby/PreSorted/10000000/manual_time 2.21 ms 2.23 ms 316 Groupby/PreSorted/100000000/manual_time 22.2 ms 22.2 ms 27 Groupby/PreSortedNth/1000000/manual_time 0.160 ms 0.188 ms 4384 Groupby/PreSortedNth/10000000/manual_time 0.888 ms 0.915 ms 775 Groupby/PreSortedNth/100000000/manual_time 8.36 ms 8.39 ms 70 Groupby/Shift/1000000/manual_time 0.764 ms 0.785 ms 904 Groupby/Shift/10000000/manual_time 9.50 ms 9.52 ms 63 Groupby/Shift/100000000/manual_time 146 ms 146 ms 4 Groupby/Aggregation/10000/manual_time 1.53 ms 1.55 ms 446 Groupby/Aggregation/16384/manual_time 1.58 ms 1.61 ms 437 Groupby/Aggregation/65536/manual_time 1.72 ms 1.75 ms 405 Groupby/Aggregation/262144/manual_time 2.93 ms 2.96 ms 236 Groupby/Aggregation/1048576/manual_time 9.18 ms 9.21 ms 74 Groupby/Aggregation/4194304/manual_time 36.2 ms 36.3 ms 19 Groupby/Aggregation/10000000/manual_time 91.5 ms 91.6 ms 7 Groupby/Scan/10000/manual_time 1.55 ms 1.57 ms 452 Groupby/Scan/16384/manual_time 1.60 ms 1.62 ms 434 Groupby/Scan/65536/manual_time 1.84 ms 1.87 ms 379 Groupby/Scan/262144/manual_time 3.54 ms 3.56 ms 198 Groupby/Scan/1048576/manual_time 12.0 ms 12.0 ms 57 Groupby/Scan/4194304/manual_time 48.4 ms 48.4 ms 14 Groupby/Scan/10000000/manual_time 125 ms 125 ms 4 Groupby/BasicNoRequest/10000/manual_time 0.058 ms 0.085 ms 11991 Groupby/BasicNoRequest/1000000/manual_time 0.282 ms 0.301 ms 2478 Groupby/BasicNoRequest/10000000/manual_time 2.42 ms 2.44 ms 291 Groupby/BasicNoRequest/100000000/manual_time 29.2 ms 29.2 ms 21 Groupby/PreSortedNoRequests/1000000/manual_time 0.195 ms 0.215 ms 3604 Groupby/PreSortedNoRequests/10000000/manual_time 1.25 ms 1.27 ms 575 Groupby/PreSortedNoRequests/100000000/manual_time 12.7 ms 12.8 ms 50 ``` </details> Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: rapidsai#9891

) (#9898) The following fixes what looks like an unintended fallback to sort aggregate introduced in #9545 for a grouping only (no aggregation request) case. In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650). I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`: ``` for (auto _ : state) { cuda_event_timer timer(state, true); cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e auto result = gb_obj.aggregate(requests); } ``` This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case. If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark. Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly. This was executed on a T4 GPU. Before the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 ``` Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: #9891

Fix fallback to sort aggregation for grouping only hash aggregate

9d7cf56

abellina requested a review from ttnghia December 12, 2021 13:49

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Dec 12, 2021

abellina removed the request for review from ttnghia December 12, 2021 13:49

abellina added bug Something isn't working non-breaking Non-breaking change Performance Performance related issue labels Dec 12, 2021

abellina added 3 commits December 12, 2021 14:05

Codestyle fixes

f3347d1

Codestyle missed a space

0334f81

Codestyle missed another space

10afade

Apply cmake-format diff

43ef6a0

jrhemstad reviewed Dec 12, 2021

View reviewed changes

Use std::any_of instead

c1113c3

abellina marked this pull request as ready for review December 13, 2021 14:13

abellina requested a review from a team as a code owner December 13, 2021 14:13

abellina requested review from trxcllnt and codereport December 13, 2021 14:13

jrhemstad reviewed Dec 13, 2021

View reviewed changes

Address code review comment

b7d2107

jrhemstad approved these changes Dec 13, 2021

View reviewed changes

ttnghia approved these changes Dec 13, 2021

View reviewed changes

codereport approved these changes Dec 13, 2021

View reviewed changes

rapids-bot bot merged commit 335862b into rapidsai:branch-22.02 Dec 13, 2021

abellina deleted the fix_sort_agg_fallback branch December 13, 2021 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fallback to sort aggregation for grouping only hash aggregate #9891

Fix fallback to sort aggregation for grouping only hash aggregate #9891

abellina commented Dec 12, 2021 •

edited

Loading

abellina commented Dec 12, 2021

abellina commented Dec 12, 2021

ttnghia commented Dec 12, 2021 •

edited

Loading

codecov bot commented Dec 12, 2021 •

edited

Loading

jrhemstad Dec 12, 2021 •

edited

Loading

abellina Dec 13, 2021

jrhemstad commented Dec 12, 2021

abellina commented Dec 13, 2021 •

edited

Loading

jrhemstad commented Dec 13, 2021

jrhemstad Dec 13, 2021

abellina Dec 13, 2021

abellina Dec 13, 2021

jlowe Dec 13, 2021

jrhemstad Dec 13, 2021

abellina Dec 13, 2021

jlowe commented Dec 13, 2021

jrhemstad commented Dec 13, 2021

codereport left a comment

abellina commented Dec 13, 2021

		auto const has_structs =
		std::any_of(requests.begin(), requests.end(), [](aggregation_request const& r) {

Fix fallback to sort aggregation for grouping only hash aggregate #9891

Fix fallback to sort aggregation for grouping only hash aggregate #9891

Conversation

abellina commented Dec 12, 2021 • edited Loading

abellina commented Dec 12, 2021

abellina commented Dec 12, 2021

ttnghia commented Dec 12, 2021 • edited Loading

codecov bot commented Dec 12, 2021 • edited Loading

Codecov Report

jrhemstad Dec 12, 2021 • edited Loading

Choose a reason for hiding this comment

abellina Dec 13, 2021

Choose a reason for hiding this comment

jrhemstad commented Dec 12, 2021

abellina commented Dec 13, 2021 • edited Loading

jrhemstad commented Dec 13, 2021

jrhemstad Dec 13, 2021

Choose a reason for hiding this comment

abellina Dec 13, 2021

Choose a reason for hiding this comment

abellina Dec 13, 2021

Choose a reason for hiding this comment

jlowe Dec 13, 2021

Choose a reason for hiding this comment

jrhemstad Dec 13, 2021

Choose a reason for hiding this comment

abellina Dec 13, 2021

Choose a reason for hiding this comment

jlowe commented Dec 13, 2021

jrhemstad commented Dec 13, 2021

codereport left a comment

Choose a reason for hiding this comment

abellina commented Dec 13, 2021

abellina commented Dec 12, 2021 •

edited

Loading

ttnghia commented Dec 12, 2021 •

edited

Loading

codecov bot commented Dec 12, 2021 •

edited

Loading

jrhemstad Dec 12, 2021 •

edited

Loading

abellina commented Dec 13, 2021 •

edited

Loading