-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] groupby aggregation falls back to sort aggregation for count on a string #10583
Comments
Based on my testing so far, we do need the has_atomic_support() check for decimal128 types. I am wondering if we should be checking the cudf::detail::target_type() for the aggregation, instead of the source? |
cc: @bdice who was looking at this recently |
I noticed this with @isVoid last week. We discussed again today and agreed that using the "target type" might be the right thing to do here. Implementation reference:
|
@jbrennan333 amazing debugging analysis. Kudos. |
I put up PR #10588 as a possible fix for this. |
Thanks @kuhushukla! @jlowe deserves credit as well for pointing me in the right direction. |
Closes #10583. Change the has_atomic_support check in can_use_hash_groupby() to check the target type for the aggregation instead of the source type. See discussion in #10583. I have verified that this fixes the performance regression in our customer queries, and all unit tests still pass. Authors: - Jim Brennan (https://github.com/jbrennan333) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #10588
Describe the bug
Regression testing of a customer query for spark-rapids found that the query is >20% slower with release 22.02 compared to release 21.08. Investigation found that the slowdown is coming from hash aggregations that are falling back to sort aggregations. With some debugging, I found that the reason we were falling back was we were failing the
has_atomic_support()
check incan_use_hash_groupby()
:My debug prints in `can_use_hash_groupby()' while running the query.
JTB: no atomic support for type 23
JTB: has_struct=false, has_atomic_support=false, all_hash_aggregations=true
JTB: groupby: sorted = false, can_use_hash_groupby = false, _helper = false
Type 23 is STRING, so we were falling back for groupby's where the value type is string. The only aggregations that reference strings are counts:
partial_gpucount((IF((__auto_generated_subquery_name.
date_diff<= 7), __auto_generated_subquery_name.
sn_id, CAST(NULL AS STRING)))#1019)
Looking at the code for
can_use_hash_groupby()
, it looks like we added a thecudf::has_atomic_support(r.values.type())
check as part of #9483, which added support for decimal128.But I am not sure I understand why this check is there?
For a hash aggregation, we call
detail::groupby()
, which callscreate_hash_map()
to create the hash map. It creates aconcurrent_unordered_map
with theKey
andElement
types both set tosize_type
. If I’m reading it right, that pair of types is what needs to be atomic-able. It looks like it always will be. I don’t see where the value types come in to play, unless there is another atomic operation somewhere?I think this
has_atomic_support
check is causing more aggregations to fallback to the sort_aggregation method than intended.I've verified that if I remove that check, this customer query goes from 70 seconds down to 43 seconds (testing in CUDF 22.02), where in 21.08 it was ~55 seconds.
Steps/Code to reproduce bug
I don't have a simple repro case yet.
Expected behavior
I expect that the performance for the customers query should improve or stay the same.
Environment overview (please complete the following information)
I tested this on an internal cluster we have with 8 nodes where each node has an a100 gpu.
Environment details
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: