[BUG] groupby aggregation falls back to sort aggregation for count on a string #10583

jbrennan333 · 2022-04-04T16:58:48Z

Describe the bug
Regression testing of a customer query for spark-rapids found that the query is >20% slower with release 22.02 compared to release 21.08. Investigation found that the slowdown is coming from hash aggregations that are falling back to sort aggregations. With some debugging, I found that the reason we were falling back was we were failing the has_atomic_support() check in can_use_hash_groupby():

My debug prints in `can_use_hash_groupby()' while running the query.
JTB: no atomic support for type 23
JTB: has_struct=false, has_atomic_support=false, all_hash_aggregations=true
JTB: groupby: sorted = false, can_use_hash_groupby = false, _helper = false

Type 23 is STRING, so we were falling back for groupby's where the value type is string. The only aggregations that reference strings are counts: partial_gpucount((IF((__auto_generated_subquery_name.date_diff <= 7), __auto_generated_subquery_name.sn_id, CAST(NULL AS STRING)))#1019)

Looking at the code for can_use_hash_groupby(), it looks like we added a the cudf::has_atomic_support(r.values.type()) check as part of #9483, which added support for decimal128.

But I am not sure I understand why this check is there?

For a hash aggregation, we call detail::groupby(), which calls create_hash_map() to create the hash map. It creates a concurrent_unordered_map with the Key and Element types both set to size_type. If I’m reading it right, that pair of types is what needs to be atomic-able. It looks like it always will be. I don’t see where the value types come in to play, unless there is another atomic operation somewhere?

I think this has_atomic_support check is causing more aggregations to fallback to the sort_aggregation method than intended.
I've verified that if I remove that check, this customer query goes from 70 seconds down to 43 seconds (testing in CUDF 22.02), where in 21.08 it was ~55 seconds.

Steps/Code to reproduce bug
I don't have a simple repro case yet.

Expected behavior
I expect that the performance for the customers query should improve or stay the same.

Environment overview (please complete the following information)
I tested this on an internal cluster we have with 8 nodes where each node has an a100 gpu.

Environment details
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

jbrennan333 · 2022-04-04T22:00:55Z

Based on my testing so far, we do need the has_atomic_support() check for decimal128 types. I am wondering if we should be checking the cudf::detail::target_type() for the aggregation, instead of the source?

jlowe · 2022-04-04T22:49:20Z

I am wondering if we should be checking the cudf::detail::target_type() for the aggregation, instead of the source?

cc: @bdice who was looking at this recently

bdice · 2022-04-05T00:20:26Z

I noticed this with @isVoid last week. We discussed again today and agreed that using the "target type" might be the right thing to do here. Implementation reference:

cudf/cpp/include/cudf/detail/aggregation/aggregation.hpp

Line 1376 in 090f6b8

using target_type_t = typename target_type_impl<Source, k>::type;

kuhushukla · 2022-04-05T01:06:28Z

@jbrennan333 amazing debugging analysis. Kudos.

jbrennan333 · 2022-04-05T01:52:42Z

I put up PR #10588 as a possible fix for this.

jbrennan333 · 2022-04-05T14:51:23Z

Thanks @kuhushukla! @jlowe deserves credit as well for pointing me in the right direction.

Closes #10583. Change the has_atomic_support check in can_use_hash_groupby() to check the target type for the aggregation instead of the source type. See discussion in #10583. I have verified that this fixes the performance regression in our customer queries, and all unit tests still pass. Authors: - Jim Brennan (https://github.com/jbrennan333) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #10588

jbrennan333 added bug Something isn't working Needs Triage Need team to review and classify labels Apr 4, 2022

jbrennan333 added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue labels Apr 4, 2022

sameerz added the Spark Functionality that helps Spark RAPIDS label Apr 4, 2022

jbrennan333 mentioned this issue Apr 5, 2022

Fix has_atomic_support check in can_use_hash_groupby() #10588

Merged

rapids-bot bot closed this as completed in #10588 Apr 5, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] groupby aggregation falls back to sort aggregation for count on a string #10583

[BUG] groupby aggregation falls back to sort aggregation for count on a string #10583

jbrennan333 commented Apr 4, 2022

jbrennan333 commented Apr 4, 2022

jlowe commented Apr 4, 2022

bdice commented Apr 5, 2022 •

edited

Loading

kuhushukla commented Apr 5, 2022

jbrennan333 commented Apr 5, 2022

jbrennan333 commented Apr 5, 2022

[BUG] groupby aggregation falls back to sort aggregation for count on a string #10583

[BUG] groupby aggregation falls back to sort aggregation for count on a string #10583

Comments

jbrennan333 commented Apr 4, 2022

jbrennan333 commented Apr 4, 2022

jlowe commented Apr 4, 2022

bdice commented Apr 5, 2022 • edited Loading

kuhushukla commented Apr 5, 2022

jbrennan333 commented Apr 5, 2022

jbrennan333 commented Apr 5, 2022

bdice commented Apr 5, 2022 •

edited

Loading