-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fallback to sort aggregation for grouping only hash aggregate #9891
Fix fallback to sort aggregation for grouping only hash aggregate #9891
Conversation
Trying to figure out why this failure:
|
I ran: |
That's not cpp style issue. It's cmake issue, maybe you need to check CmakeList.txt to remove spaces etc. (or run cmake format). |
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9891 +/- ##
================================================
- Coverage 10.49% 10.42% -0.07%
================================================
Files 119 119
Lines 20305 20479 +174
================================================
+ Hits 2130 2134 +4
- Misses 18175 18345 +170
Continue to review full report at Codecov.
|
cpp/src/groupby/hash/groupby.cu
Outdated
@@ -647,12 +647,12 @@ bool can_use_hash_groupby(table_view const& keys, host_span<aggregation_request | |||
// Currently, structs are not supported in any of hash-based aggregations. | |||
// Therefore, if any request contains structs then we must fallback to sort-based aggregations. | |||
// TODO: Support structs in hash-based aggregations. | |||
auto const has_struct = | |||
auto const all_no_structs = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the has_structs
name and this should be a std::any_of
for r.values.type().id() == type_id::STRUCT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I agree the logic needs to be fixed, but can you elaborate on what the use case is for a |
This type of call is done when we are trying to compute a Spark implements the first aggregation for the distinct (the one finding the unique rows) by performing a group-by where the "keys" are both the requested set of keys (if the user specified So for an example such as:
|
Wouldn't you just use cudf/cpp/include/cudf/stream_compaction.hpp Line 238 in d23bcb4
There's a lot of extra machinery going on in a groupby that doing this operation through a groupby w/ no aggregation would be possibly inefficient or this functionality could be removed, etc. |
cpp/src/groupby/hash/groupby.cu
Outdated
auto const has_structs = | ||
std::any_of(requests.begin(), requests.end(), [](aggregation_request const& r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, this is silly to progress the list of aggregation_requests twice. Just put the struct check above:
auto const all_hash_aggregations =
std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
return not r.values.type().id() == type_id::STRUCT
and cudf::has_atomic_support(r.values.type()) and
std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
return is_hash_aggregation(a->kind);
});
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing, will push a patch shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait I think this needs to be:
auto const all_hash_aggregations = requests.empty() or
std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
return not r.values.type().id() == type_id::STRUCT
and cudf::has_atomic_support(r.values.type()) and
std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
return is_hash_aggregation(a->kind);
});
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::all_of
returns true if the range is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_of
already returns true for an empty range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's right :)
drop_duplicates performs a stable sort which seems much more expensive in practice than the hash-based approach. The large perf regression caused by accidentally falling back to a sort-based aggregation implies this as well. |
Agreed, but I want to fix that with this: #9413 I'm honestly just surprised a hash groupby without any aggregations works. A hash-based drop_duplicates should be faster still. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
@gpucibot merge |
…pidsai#9891) The following fixes what looks like an unintended fallback to sort aggregate introduced in rapidsai#9545 for a grouping only (no aggregation request) case. In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650). I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`: ``` for (auto _ : state) { cuda_event_timer timer(state, true); cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e auto result = gb_obj.aggregate(requests); } ``` This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case. If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark. Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly. This was executed on a T4 GPU. Before the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 ``` <details> <summary>Full output</summary> <p> ``` 2021-12-12T13:41:08+00:00 Running ./GROUPBY_BENCH Run on (64 X 2801.89 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 1024 KiB (x32) L3 Unified 22528 KiB (x2) Load Average: 1.01, 0.78, 0.42 -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------- Groupby/Basic/10000/manual_time 0.117 ms 0.143 ms 5906 Groupby/Basic/1000000/manual_time 0.524 ms 0.542 ms 1352 Groupby/Basic/10000000/manual_time 4.41 ms 4.43 ms 159 Groupby/Basic/100000000/manual_time 50.1 ms 50.1 ms 13 Groupby/PreSorted/1000000/manual_time 0.332 ms 0.350 ms 2118 Groupby/PreSorted/10000000/manual_time 2.22 ms 2.23 ms 315 Groupby/PreSorted/100000000/manual_time 22.2 ms 22.2 ms 30 Groupby/PreSortedNth/1000000/manual_time 0.160 ms 0.188 ms 4381 Groupby/PreSortedNth/10000000/manual_time 0.890 ms 0.917 ms 774 Groupby/PreSortedNth/100000000/manual_time 8.43 ms 8.46 ms 68 Groupby/Shift/1000000/manual_time 0.764 ms 0.785 ms 902 Groupby/Shift/10000000/manual_time 9.51 ms 9.53 ms 63 Groupby/Shift/100000000/manual_time 145 ms 145 ms 4 Groupby/Aggregation/10000/manual_time 1.56 ms 1.58 ms 442 Groupby/Aggregation/16384/manual_time 1.59 ms 1.62 ms 435 Groupby/Aggregation/65536/manual_time 1.73 ms 1.76 ms 404 Groupby/Aggregation/262144/manual_time 2.95 ms 2.98 ms 237 Groupby/Aggregation/1048576/manual_time 9.20 ms 9.23 ms 73 Groupby/Aggregation/4194304/manual_time 36.3 ms 36.3 ms 19 Groupby/Aggregation/10000000/manual_time 92.0 ms 92.1 ms 7 Groupby/Scan/10000/manual_time 1.56 ms 1.58 ms 447 Groupby/Scan/16384/manual_time 1.62 ms 1.65 ms 429 Groupby/Scan/65536/manual_time 1.85 ms 1.88 ms 378 Groupby/Scan/262144/manual_time 3.54 ms 3.56 ms 197 Groupby/Scan/1048576/manual_time 12.0 ms 12.0 ms 57 Groupby/Scan/4194304/manual_time 48.6 ms 48.6 ms 14 Groupby/Scan/10000000/manual_time 126 ms 126 ms 4 Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 Groupby/PreSortedNoRequests/1000000/manual_time 0.194 ms 0.214 ms 3624 Groupby/PreSortedNoRequests/10000000/manual_time 1.25 ms 1.27 ms 571 Groupby/PreSortedNoRequests/100000000/manual_time 12.6 ms 12.7 ms 50 ``` </details> After the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.058 ms 0.085 ms 11991 Groupby/BasicNoRequest/1000000/manual_time 0.282 ms 0.301 ms 2478 Groupby/BasicNoRequest/10000000/manual_time 2.42 ms 2.44 ms 291 Groupby/BasicNoRequest/100000000/manual_time 29.2 ms 29.2 ms 21 ``` <details> <summary>Full output</summary> <p> ``` 2021-12-12T13:37:50+00:00 Running ./GROUPBY_BENCH Run on (64 X 2654.22 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 1024 KiB (x32) L3 Unified 22528 KiB (x2) Load Average: 0.64, 0.50, 0.26 -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------- Groupby/Basic/10000/manual_time 0.116 ms 0.142 ms 5918 Groupby/Basic/1000000/manual_time 0.523 ms 0.542 ms 1374 Groupby/Basic/10000000/manual_time 4.37 ms 4.39 ms 162 Groupby/Basic/100000000/manual_time 51.4 ms 51.5 ms 10 Groupby/PreSorted/1000000/manual_time 0.331 ms 0.350 ms 2121 Groupby/PreSorted/10000000/manual_time 2.21 ms 2.23 ms 316 Groupby/PreSorted/100000000/manual_time 22.2 ms 22.2 ms 27 Groupby/PreSortedNth/1000000/manual_time 0.160 ms 0.188 ms 4384 Groupby/PreSortedNth/10000000/manual_time 0.888 ms 0.915 ms 775 Groupby/PreSortedNth/100000000/manual_time 8.36 ms 8.39 ms 70 Groupby/Shift/1000000/manual_time 0.764 ms 0.785 ms 904 Groupby/Shift/10000000/manual_time 9.50 ms 9.52 ms 63 Groupby/Shift/100000000/manual_time 146 ms 146 ms 4 Groupby/Aggregation/10000/manual_time 1.53 ms 1.55 ms 446 Groupby/Aggregation/16384/manual_time 1.58 ms 1.61 ms 437 Groupby/Aggregation/65536/manual_time 1.72 ms 1.75 ms 405 Groupby/Aggregation/262144/manual_time 2.93 ms 2.96 ms 236 Groupby/Aggregation/1048576/manual_time 9.18 ms 9.21 ms 74 Groupby/Aggregation/4194304/manual_time 36.2 ms 36.3 ms 19 Groupby/Aggregation/10000000/manual_time 91.5 ms 91.6 ms 7 Groupby/Scan/10000/manual_time 1.55 ms 1.57 ms 452 Groupby/Scan/16384/manual_time 1.60 ms 1.62 ms 434 Groupby/Scan/65536/manual_time 1.84 ms 1.87 ms 379 Groupby/Scan/262144/manual_time 3.54 ms 3.56 ms 198 Groupby/Scan/1048576/manual_time 12.0 ms 12.0 ms 57 Groupby/Scan/4194304/manual_time 48.4 ms 48.4 ms 14 Groupby/Scan/10000000/manual_time 125 ms 125 ms 4 Groupby/BasicNoRequest/10000/manual_time 0.058 ms 0.085 ms 11991 Groupby/BasicNoRequest/1000000/manual_time 0.282 ms 0.301 ms 2478 Groupby/BasicNoRequest/10000000/manual_time 2.42 ms 2.44 ms 291 Groupby/BasicNoRequest/100000000/manual_time 29.2 ms 29.2 ms 21 Groupby/PreSortedNoRequests/1000000/manual_time 0.195 ms 0.215 ms 3604 Groupby/PreSortedNoRequests/10000000/manual_time 1.25 ms 1.27 ms 575 Groupby/PreSortedNoRequests/100000000/manual_time 12.7 ms 12.8 ms 50 ``` </details> Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: rapidsai#9891
) (#9898) The following fixes what looks like an unintended fallback to sort aggregate introduced in #9545 for a grouping only (no aggregation request) case. In the PR, the `std::all_of` function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests the `std::all_of` function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650). I added a benchmark `group_no_requests_benchmark.cu` by mostly copying `group_sum_benchmark.cu` but I changed one critical part. I am re-creating the `groupby` object for each `state`: ``` for (auto _ : state) { cuda_event_timer timer(state, true); cudf::groupby::groupby gb_obj(cudf::table_view({keys}));e auto result = gb_obj.aggregate(requests); } ``` This shows what would happen in the scenario where the `groupby` instance is created each time an aggregate is issued, which would re-create the `helper` each time for the sorted case. If the `groupby` object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle a `groupby` instance when using the groupby API from Spark. Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly. This was executed on a T4 GPU. Before the patch: ``` Groupby/BasicNoRequest/10000/manual_time 0.158 ms 0.184 ms 4420 Groupby/BasicNoRequest/1000000/manual_time 1.72 ms 1.74 ms 408 Groupby/BasicNoRequest/10000000/manual_time 18.9 ms 18.9 ms 37 Groupby/BasicNoRequest/100000000/manual_time 198 ms 198 ms 3 ``` Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: #9891
The following fixes what looks like an unintended fallback to sort aggregate introduced in #9545 for a grouping only (no aggregation request) case.
In the PR, the
std::all_of
function is used to determine whether the aggregation requests would be for struct types. That said, when there are no aggregation requests thestd::all_of
function will return true, causing a fallback to the sort aggregation (relevant code: https://github.com/rapidsai/cudf/pull/9545/files#diff-e409f72ddc11ad10fa0099e21b409b92f12bfac8ba1817266696c34a620aa081R645-R650).I added a benchmark
group_no_requests_benchmark.cu
by mostly copyinggroup_sum_benchmark.cu
but I changed one critical part. I am re-creating thegroupby
object for eachstate
:This shows what would happen in the scenario where the
groupby
instance is created each time an aggregate is issued, which would re-create thehelper
each time for the sorted case.If the
groupby
object is not recreated each time, the difference in performance between the before/after cases is negligible. We never recycle agroupby
instance when using the groupby API from Spark.Posting this as draft for feedback as I am not sure if I handled the benchmark part correctly.
This was executed on a T4 GPU.
Before the patch:
Full output
After the patch:
Full output