Groupby hash aggregations use sort-based implementation if nested-type columns are used as values #14412

divyegala · 2023-11-14T19:44:30Z

We should be able to use nested-type columns as values and still be able to invoke a hash-based groupby, as hash-based is generally faster so we do not want to be silently using sort-based.

cudf/cpp/src/groupby/hash/groupby.cu

Lines 654 to 656 in abc0d41

    
           // Currently, input values (not keys) of STRUCT and LIST types are not supported in any of 
        
           // hash-based aggregations. For those situations, we fallback to sort-based aggregations. 
        
           if (v_type.id() == type_id::STRUCT or v_type.id() == type_id::LIST) { return false; }

Reference thread: #13795 (comment)

bdice · 2023-11-14T19:50:43Z

This code block is the piece in question:

cudf/cpp/src/groupby/hash/groupby.cu

Lines 654 to 656 in b446a6f

    
           // Currently, input values (not keys) of STRUCT and LIST types are not supported in any of 
        
           // hash-based aggregations. For those situations, we fallback to sort-based aggregations. 
        
           if (v_type.id() == type_id::STRUCT or v_type.id() == type_id::LIST) { return false; }

@ttnghia, you added this in #13676. Do you know if this fallback is still required, or why?

We discussed this a bit here: #13676 (comment)

Side note, I'm punching my past self -- I asked this question on that PR, and never submitted my review:

ttnghia · 2023-11-14T20:11:15Z

@ttnghia, you added this in #13676. Do you know if this fallback is still required, or why?

Because hash-based aggregations are implemented for plain type only, using the operator such as < instead of user-provided comparator. See struct update_target_element in cpp/include/cudf/detail/aggregation/aggregation.cuh.

If we want to support hash-based aggregates for nested types, we need to rewrite such struct update_target_element such that we can compare rows using a row comparator instead.

bdice · 2023-11-14T20:32:37Z

Great, that was helpful. I think we can do this. I think the rough plan would be to preprocess the table so we have device comparators that can be used, pass the preprocessed table info through all the aggregation machinery, and use the device comparator where needed in update_target_element. Does that sound right to you?

ttnghia · 2023-11-14T20:38:21Z

Yes that sounds good. Note that we only need to rework for ARGMIN and ARGMAX (other aggregations are SUM, PRODUCT etc that can't support nested types), not for everything thus the amount of work should not be very heavy.

PointKernel · 2025-01-16T18:56:50Z

To clarify, the reason we cannot use hash-based groupby for nested types is that there is currently no way to atomically update nested data on the device due to the lack of direct hardware support for such operations. A possible solution is to use an atomic lock table, which CCCL is expected to support in the future NVIDIA/cccl#990.

We should backlog this for now until the atomic lock table becomes available.

divyegala added bug Something isn't working Needs Triage Need team to review and classify labels Nov 14, 2023

divyegala mentioned this issue Nov 14, 2023

Example code for blog on new row comparators #13795

Merged

3 tasks

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 14, 2023

GregoryKimball added this to the Aggregations continuous improvement milestone Dec 14, 2023

GregoryKimball added this to libcudf Dec 14, 2023

GregoryKimball moved this to Needs owner in libcudf Dec 14, 2023

GregoryKimball added feature request New feature or request Performance Performance related issue and removed feature request New feature or request labels Dec 14, 2023

GregoryKimball moved this from Needs owner to To be revisited in libcudf Feb 20, 2024

bdice mentioned this issue Jul 9, 2024

Support min_by/max_by group by aggregate #16163

Closed

3 tasks

thirtiseven mentioned this issue Aug 30, 2024

[FEA][Follow on] Improve performance of min_by and max_by NVIDIA/spark-rapids#11412

Open

PointKernel self-assigned this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby hash aggregations use sort-based implementation if nested-type columns are used as values #14412

Groupby hash aggregations use sort-based implementation if nested-type columns are used as values #14412

divyegala commented Nov 14, 2023 •

edited

Loading

bdice commented Nov 14, 2023

ttnghia commented Nov 14, 2023 •

edited

Loading

bdice commented Nov 14, 2023

ttnghia commented Nov 14, 2023 •

edited

Loading

PointKernel commented Jan 16, 2025 •

edited

Loading

Groupby hash aggregations use sort-based implementation if nested-type columns are used as values #14412

Groupby hash aggregations use sort-based implementation if nested-type columns are used as values #14412

Comments

divyegala commented Nov 14, 2023 • edited Loading

bdice commented Nov 14, 2023

ttnghia commented Nov 14, 2023 • edited Loading

bdice commented Nov 14, 2023

ttnghia commented Nov 14, 2023 • edited Loading

PointKernel commented Jan 16, 2025 • edited Loading

divyegala commented Nov 14, 2023 •

edited

Loading

ttnghia commented Nov 14, 2023 •

edited

Loading

ttnghia commented Nov 14, 2023 •

edited

Loading

PointKernel commented Jan 16, 2025 •

edited

Loading