Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fallback to sort aggregation for grouping only hash aggregate #9891

Merged
merged 7 commits into from
Dec 13, 2021

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Dec 12, 2021

The following fixes what looks like an unintended fallback to sort aggregate introduced in #9545 for a grouping only (no aggregation request) case.

In the PR, the std::all_of function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the std::all_of function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650).

I added a benchmark group_no_requests_benchmark.cu by mostly copying group_sum_benchmark.cu but I changed one critical part. I am re-creating the groupby object for each state:

  for (auto _ : state) {
    cuda_event_timer timer(state, true);
    cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e
    auto result = gb_obj.aggregate(requests);
  }

This shows what would happen in the scenario where the groupby instance is created each time an aggregate is issued, which would re-create the helper each time for the sorted case.

If the groupby object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a groupby instance when using the groupby API from Spark.

Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly.

This was executed on a T4 GPU.

Before the patch:

Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
Full output

2021-12-12T13:41:08+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2801.89 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 1.01, 0.78, 0.42
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.117 ms        0.143 ms         5906
Groupby/Basic/1000000/manual_time                      0.524 ms        0.542 ms         1352
Groupby/Basic/10000000/manual_time                      4.41 ms         4.43 ms          159
Groupby/Basic/100000000/manual_time                     50.1 ms         50.1 ms           13
Groupby/PreSorted/1000000/manual_time                  0.332 ms        0.350 ms         2118
Groupby/PreSorted/10000000/manual_time                  2.22 ms         2.23 ms          315
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           30
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4381
Groupby/PreSortedNth/10000000/manual_time              0.890 ms        0.917 ms          774
Groupby/PreSortedNth/100000000/manual_time              8.43 ms         8.46 ms           68
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          902
Groupby/Shift/10000000/manual_time                      9.51 ms         9.53 ms           63
Groupby/Shift/100000000/manual_time                      145 ms          145 ms            4
Groupby/Aggregation/10000/manual_time                   1.56 ms         1.58 ms          442
Groupby/Aggregation/16384/manual_time                   1.59 ms         1.62 ms          435
Groupby/Aggregation/65536/manual_time                   1.73 ms         1.76 ms          404
Groupby/Aggregation/262144/manual_time                  2.95 ms         2.98 ms          237
Groupby/Aggregation/1048576/manual_time                 9.20 ms         9.23 ms           73
Groupby/Aggregation/4194304/manual_time                 36.3 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                92.0 ms         92.1 ms            7
Groupby/Scan/10000/manual_time                          1.56 ms         1.58 ms          447
Groupby/Scan/16384/manual_time                          1.62 ms         1.65 ms          429
Groupby/Scan/65536/manual_time                          1.85 ms         1.88 ms          378
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          197
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.6 ms         48.6 ms           14
Groupby/Scan/10000000/manual_time                        126 ms          126 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
Groupby/PreSortedNoRequests/1000000/manual_time        0.194 ms        0.214 ms         3624
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          571
Groupby/PreSortedNoRequests/100000000/manual_time       12.6 ms         12.7 ms           50

After the patch:

Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
Full output

2021-12-12T13:37:50+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2654.22 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 0.64, 0.50, 0.26
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.116 ms        0.142 ms         5918
Groupby/Basic/1000000/manual_time                      0.523 ms        0.542 ms         1374
Groupby/Basic/10000000/manual_time                      4.37 ms         4.39 ms          162
Groupby/Basic/100000000/manual_time                     51.4 ms         51.5 ms           10
Groupby/PreSorted/1000000/manual_time                  0.331 ms        0.350 ms         2121
Groupby/PreSorted/10000000/manual_time                  2.21 ms         2.23 ms          316
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           27
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4384
Groupby/PreSortedNth/10000000/manual_time              0.888 ms        0.915 ms          775
Groupby/PreSortedNth/100000000/manual_time              8.36 ms         8.39 ms           70
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          904
Groupby/Shift/10000000/manual_time                      9.50 ms         9.52 ms           63
Groupby/Shift/100000000/manual_time                      146 ms          146 ms            4
Groupby/Aggregation/10000/manual_time                   1.53 ms         1.55 ms          446
Groupby/Aggregation/16384/manual_time                   1.58 ms         1.61 ms          437
Groupby/Aggregation/65536/manual_time                   1.72 ms         1.75 ms          405
Groupby/Aggregation/262144/manual_time                  2.93 ms         2.96 ms          236
Groupby/Aggregation/1048576/manual_time                 9.18 ms         9.21 ms           74
Groupby/Aggregation/4194304/manual_time                 36.2 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                91.5 ms         91.6 ms            7
Groupby/Scan/10000/manual_time                          1.55 ms         1.57 ms          452
Groupby/Scan/16384/manual_time                          1.60 ms         1.62 ms          434
Groupby/Scan/65536/manual_time                          1.84 ms         1.87 ms          379
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          198
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.4 ms         48.4 ms           14
Groupby/Scan/10000000/manual_time                        125 ms          125 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
Groupby/PreSortedNoRequests/1000000/manual_time        0.195 ms        0.215 ms         3604
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          575
Groupby/PreSortedNoRequests/100000000/manual_time       12.7 ms         12.8 ms           50

@abellina abellina requested a review from ttnghia December 12, 2021 13:49
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Dec 12, 2021
@abellina abellina removed the request for review from ttnghia December 12, 2021 13:49
@abellina abellina added bug Something isn't working non-breaking Non-breaking change Performance Performance related issue labels Dec 12, 2021
@abellina
Copy link
Contributor Author

Trying to figure out why this failure:

>>>> PASSED: clang format check
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
>>>> PASSED: libcudf header existence conda/recipes/libcudf/meta.yaml check
>>>> PASSED: libcudf header existence conda/recipes/libcudf/meta.yaml check
Build step 'Execute shell' marked build as failure
[Set GitHub commit status (universal)] ERROR on repos [GHRepository@6e42889e[nodeId=MDEwOlJlcG9zaXRvcnk5MDUwNjkxOA==,description=cuDF - GPU DataFrame Library ,homepage=http://rapids.ai,name=cudf,fork=false,archived=false,size=93286,milestones={},language=C++,commits={},source=<null>,parent=<null>,isTemplate=false,url=https://api.github.com/repos/rapidsai/cudf,id=90506918,nodeId=<null>,createdAt=2017-05-07T03:43:37Z,updatedAt=2021-12-12T11:11:15Z]] (sha:10afade) with context:gpuCI/cudf/check/style

@abellina
Copy link
Contributor Author

I ran: python3 ./cpp/scripts/run-clang-format.py -inplace but no diff or errors seen, so I am not sure where the error above comes from.

@ttnghia
Copy link
Contributor

ttnghia commented Dec 12, 2021

cmake-format.............................................................Failed

That's not cpp style issue. It's cmake issue, maybe you need to check CmakeList.txt to remove spaces etc. (or run cmake format).

@codecov
Copy link

codecov bot commented Dec 12, 2021

Codecov Report

Merging #9891 (5a5045a) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

❗ Current head 5a5045a differs from pull request most recent head b7d2107. Consider uploading reports for the commit b7d2107 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9891      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20479     +174     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18345     +170     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e581734...b7d2107. Read the comment docs.

@@ -647,12 +647,12 @@ bool can_use_hash_groupby(table_view const& keys, host_span<aggregation_request
// Currently, structs are not supported in any of hash-based aggregations.
// Therefore, if any request contains structs then we must fallback to sort-based aggregations.
// TODO: Support structs in hash-based aggregations.
auto const has_struct =
auto const all_no_structs =
Copy link
Contributor

@jrhemstad jrhemstad Dec 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the has_structs name and this should be a std::any_of for r.values.type().id() == type_id::STRUCT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jrhemstad
Copy link
Contributor

a grouping only (no aggregation request)

I agree the logic needs to be fixed, but can you elaborate on what the use case is for a groupby::aggregate without an aggregation? I wasn't aware this is something that was even supported.

@abellina
Copy link
Contributor Author

abellina commented Dec 13, 2021

can you elaborate on what the use case is for a groupby::aggregate without an aggregation? I wasn't aware this is something that was even supported.

This type of call is done when we are trying to compute a DISTINCT aggregation.

Spark implements the first aggregation for the distinct (the one finding the unique rows) by performing a group-by where the "keys" are both the requested set of keys (if the user specified GROUP BY) and the columns being aggregated (select count(distinct foo) would add foo to the keys). The product of this first group-by without aggregate functions are unique rows we can then use to perform an aggregate (like count, sum, etc).

So for an example such as: select count(distinct my_col) from my_table. This doesn't have a GROUP BY clause, but it boils down to a group-by aggregate in cuDF terms:

  1. Perform a group-by with my_col as the key. Do not request a count aggregation yet as we would be counting non-distinct. This is the call that sees no aggregation requests to cuDF.
  2. Partition + shuffle so we can distribute this and reduce.
  3. Perform an aggregate where we count these unique rows. This is a reduction for this example, but in the case where the user specified a GROUP BY it would be another group-by aggregate with a count aggregate request, since we'd care about the grouping key.
  4. Results are merged together for each individual batch, that would be another reduction with sum in this case (or a group by with a sum aggregation request for the GROUP BY case).

@abellina abellina marked this pull request as ready for review December 13, 2021 14:13
@abellina abellina requested a review from a team as a code owner December 13, 2021 14:13
@jrhemstad
Copy link
Contributor

This type of call is done when we are trying to compute a DISTINCT aggregation.

Wouldn't you just use drop_duplicates to get the distinct keys?

std::unique_ptr<table> drop_duplicates(

There's a lot of extra machinery going on in a groupby that doing this operation through a groupby w/ no aggregation would be possibly inefficient or this functionality could be removed, etc.

Comment on lines 650 to 651
auto const has_structs =
std::any_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, this is silly to progress the list of aggregation_requests twice. Just put the struct check above:


  auto const all_hash_aggregations =
    std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
      return not r.values.type().id() == type_id::STRUCT 
             and cudf::has_atomic_support(r.values.type()) and
             std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
               return is_hash_aggregation(a->kind);
             });
    });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing, will push a patch shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait I think this needs to be:

  auto const all_hash_aggregations = requests.empty() or
    std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
      return not r.values.type().id() == type_id::STRUCT 
             and cudf::has_atomic_support(r.values.type()) and
             std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
               return is_hash_aggregation(a->kind);
             });
    });

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::all_of returns true if the range is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_of already returns true for an empty range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's right :)

@jlowe
Copy link
Member

jlowe commented Dec 13, 2021

Wouldn't you just use drop_duplicates to get the distinct keys?

drop_duplicates performs a stable sort which seems much more expensive in practice than the hash-based approach. The large perf regression caused by accidentally falling back to a sort-based aggregation implies this as well.

@jrhemstad
Copy link
Contributor

drop_duplicates performs a stable sort which seems much more expensive in practice than the hash-based approach. The large perf regression caused by accidentally falling back to a sort-based aggregation implies this as well.

Agreed, but I want to fix that with this: #9413

I'm honestly just surprised a hash groupby without any aggregations works. A hash-based drop_duplicates should be faster still.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@abellina
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 335862b into rapidsai:branch-22.02 Dec 13, 2021
@abellina abellina deleted the fix_sort_agg_fallback branch December 13, 2021 21:08
abellina added a commit to abellina/cudf that referenced this pull request Dec 14, 2021
…pidsai#9891)

The following fixes what looks like an unintended fallback to sort aggregate introduced in rapidsai#9545 for a grouping only (no aggregation request) case.

In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650).

I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`:

```
  for (auto _ : state) {
    cuda_event_timer timer(state, true);
    cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e
    auto result = gb_obj.aggregate(requests);
  }
```

This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case.

If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark.

Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly.

This was executed on a T4 GPU.

Before the patch:

```
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
```
<details>
<summary>Full output</summary>
<p>

```
2021-12-12T13:41:08+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2801.89 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 1.01, 0.78, 0.42
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.117 ms        0.143 ms         5906
Groupby/Basic/1000000/manual_time                      0.524 ms        0.542 ms         1352
Groupby/Basic/10000000/manual_time                      4.41 ms         4.43 ms          159
Groupby/Basic/100000000/manual_time                     50.1 ms         50.1 ms           13
Groupby/PreSorted/1000000/manual_time                  0.332 ms        0.350 ms         2118
Groupby/PreSorted/10000000/manual_time                  2.22 ms         2.23 ms          315
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           30
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4381
Groupby/PreSortedNth/10000000/manual_time              0.890 ms        0.917 ms          774
Groupby/PreSortedNth/100000000/manual_time              8.43 ms         8.46 ms           68
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          902
Groupby/Shift/10000000/manual_time                      9.51 ms         9.53 ms           63
Groupby/Shift/100000000/manual_time                      145 ms          145 ms            4
Groupby/Aggregation/10000/manual_time                   1.56 ms         1.58 ms          442
Groupby/Aggregation/16384/manual_time                   1.59 ms         1.62 ms          435
Groupby/Aggregation/65536/manual_time                   1.73 ms         1.76 ms          404
Groupby/Aggregation/262144/manual_time                  2.95 ms         2.98 ms          237
Groupby/Aggregation/1048576/manual_time                 9.20 ms         9.23 ms           73
Groupby/Aggregation/4194304/manual_time                 36.3 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                92.0 ms         92.1 ms            7
Groupby/Scan/10000/manual_time                          1.56 ms         1.58 ms          447
Groupby/Scan/16384/manual_time                          1.62 ms         1.65 ms          429
Groupby/Scan/65536/manual_time                          1.85 ms         1.88 ms          378
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          197
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.6 ms         48.6 ms           14
Groupby/Scan/10000000/manual_time                        126 ms          126 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
Groupby/PreSortedNoRequests/1000000/manual_time        0.194 ms        0.214 ms         3624
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          571
Groupby/PreSortedNoRequests/100000000/manual_time       12.6 ms         12.7 ms           50
```

</details>

After the patch:

```
Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
```

<details>
<summary>Full output</summary>
<p>

```
2021-12-12T13:37:50+00:00
Running ./GROUPBY_BENCH
Run on (64 X 2654.22 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 22528 KiB (x2)
Load Average: 0.64, 0.50, 0.26
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                        0.116 ms        0.142 ms         5918
Groupby/Basic/1000000/manual_time                      0.523 ms        0.542 ms         1374
Groupby/Basic/10000000/manual_time                      4.37 ms         4.39 ms          162
Groupby/Basic/100000000/manual_time                     51.4 ms         51.5 ms           10
Groupby/PreSorted/1000000/manual_time                  0.331 ms        0.350 ms         2121
Groupby/PreSorted/10000000/manual_time                  2.21 ms         2.23 ms          316
Groupby/PreSorted/100000000/manual_time                 22.2 ms         22.2 ms           27
Groupby/PreSortedNth/1000000/manual_time               0.160 ms        0.188 ms         4384
Groupby/PreSortedNth/10000000/manual_time              0.888 ms        0.915 ms          775
Groupby/PreSortedNth/100000000/manual_time              8.36 ms         8.39 ms           70
Groupby/Shift/1000000/manual_time                      0.764 ms        0.785 ms          904
Groupby/Shift/10000000/manual_time                      9.50 ms         9.52 ms           63
Groupby/Shift/100000000/manual_time                      146 ms          146 ms            4
Groupby/Aggregation/10000/manual_time                   1.53 ms         1.55 ms          446
Groupby/Aggregation/16384/manual_time                   1.58 ms         1.61 ms          437
Groupby/Aggregation/65536/manual_time                   1.72 ms         1.75 ms          405
Groupby/Aggregation/262144/manual_time                  2.93 ms         2.96 ms          236
Groupby/Aggregation/1048576/manual_time                 9.18 ms         9.21 ms           74
Groupby/Aggregation/4194304/manual_time                 36.2 ms         36.3 ms           19
Groupby/Aggregation/10000000/manual_time                91.5 ms         91.6 ms            7
Groupby/Scan/10000/manual_time                          1.55 ms         1.57 ms          452
Groupby/Scan/16384/manual_time                          1.60 ms         1.62 ms          434
Groupby/Scan/65536/manual_time                          1.84 ms         1.87 ms          379
Groupby/Scan/262144/manual_time                         3.54 ms         3.56 ms          198
Groupby/Scan/1048576/manual_time                        12.0 ms         12.0 ms           57
Groupby/Scan/4194304/manual_time                        48.4 ms         48.4 ms           14
Groupby/Scan/10000000/manual_time                        125 ms          125 ms            4
Groupby/BasicNoRequest/10000/manual_time               0.058 ms        0.085 ms        11991
Groupby/BasicNoRequest/1000000/manual_time             0.282 ms        0.301 ms         2478
Groupby/BasicNoRequest/10000000/manual_time             2.42 ms         2.44 ms          291
Groupby/BasicNoRequest/100000000/manual_time            29.2 ms         29.2 ms           21
Groupby/PreSortedNoRequests/1000000/manual_time        0.195 ms        0.215 ms         3604
Groupby/PreSortedNoRequests/10000000/manual_time        1.25 ms         1.27 ms          575
Groupby/PreSortedNoRequests/100000000/manual_time       12.7 ms         12.8 ms           50
```

</details>

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Conor Hoekstra (https://github.com/codereport)

URL: rapidsai#9891
raydouglass pushed a commit that referenced this pull request Dec 16, 2021
) (#9898)

The following fixes what looks like an unintended fallback to sort aggregate introduced in #9545 for a grouping only (no aggregation request) case.

In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650).

I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`:

```
  for (auto _ : state) {
    cuda_event_timer timer(state, true);
    cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e
    auto result = gb_obj.aggregate(requests);
  }
```

This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case.

If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark.

Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly.

This was executed on a T4 GPU.

Before the patch:

```
Groupby/BasicNoRequest/10000/manual_time               0.158 ms        0.184 ms         4420
Groupby/BasicNoRequest/1000000/manual_time              1.72 ms         1.74 ms          408
Groupby/BasicNoRequest/10000000/manual_time             18.9 ms         18.9 ms           37
Groupby/BasicNoRequest/100000000/manual_time             198 ms          198 ms            3
```

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants