Revert sum/product aggregation to always produce `int64_t` type #14907

SurajAralihalli · 2024-01-26T23:31:53Z

Description

This pull request reverses the modifications made to the sum/product aggregation target type, ensuring it always produces int64. The changes implemented by PR 14679 which led to degraded performance when the aggregation column had an unsigned type, are reverted. Additional details can be found in the issue 14886.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

ttnghia · 2024-01-26T23:42:53Z

I still don't understand what is the reason that causes performance issue there? Why can't we fix that issue instead of reverting code like this?

GregoryKimball · 2024-01-26T23:50:14Z

Thank you @ttnghia for your message. I also would prefer to solve the root cause of the issue rather than revert the change. It appears that the degenerate performance is happening in a thrust::for_each_n call here in groupby.cu. I'm concerned that there isn't an obvious libcudf fix, and the root cause could be in CCCL. If the fix isn't a simple ~1-10 line change in libcudf... I think it's going to be too high risk for 24.02.

ttnghia

I'm fine to have this going into 24.02, but we should investigate further for it later. I suspect that it could be some thing messing with the device sum operator working on different operand types.

karthikeyann · 2024-01-29T16:47:26Z

@ttnghia The atomicAdd<T>( has no specialization for uint32_t and uint64_t. So it goes for CAS-while loop generic atomic implementation, which slows the groupby sum operation on uint column.

ttnghia · 2024-01-29T17:17:37Z

@ttnghia The atomicAdd<T>( has no specialization for uint32_t and uint64_t. So it goes for CAS-while loop generic atomic implementation, which slows the groupby sum operation on uint column.

Wait, from the cuda documentation I see that unsigned int/long are supported?
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

So there maybe something wrong with the groupby code that doesn't call native CUDA atomicAdd but instead calls to the generic CAS-while loop.

GregoryKimball · 2024-01-29T17:42:35Z

Thank you @karthikeyann and @ttnghia for this investigation. We would love your help making the groupby code work correctly with uint64 in 24.04. If this PR is ready to go, would you please request an admin merge in 24.02?

bdice · 2024-01-29T19:11:52Z

@karthikeyann's analysis is correct, from what I can tell. I think he was referring to the implementation in device_atomics.cuh, which has a limited implementation.

cudf/cpp/include/cudf/detail/utilities/device_atomics.cuh

Lines 160 to 162 in 5cc021a

    
           // specialized functions for operators 
        
           // `atomicAdd` supports int32, float, double (signed int64 is not supported.) 
        
           // `atomicMin`, `atomicMax` support int32_t, int64_t

That file only implements DeviceSum for float, double, int32_t, and int64_t. It seems reasonable to add uint32_t and uint64_t implementations, which could call the CUDA atomicAdd for their respective types. We would want to add static assertions like this to ensure that the type aliases like unsigned long long int are equivalent to uint64_t.

davidwendt · 2024-01-29T19:23:27Z

Maybe cuda::atomic_ref could be used instead of the device_atomics.cuh functions though I'm not sure if it is supported for unsigned types.
Reference: #13583
@PointKernel

ttnghia · 2024-01-29T19:26:24Z

From what I understand:

device_atomics.cuh is to specialize for the types that are not natively supported by cuda atomicXXX functions. As such, in case of calling to atomicAdd for uint64_t, the compiler should call to the native atomicAdd function, not cudf::detail::atomicAdd function.
For uint64_t , if cudf::detail::atomicAdd is called, which in turn calls to the generic CAS-while loop, there should be something wrong with the call chain.
The caller is in cudf::detail:: namespace, and it just calls atomicAdd thus it is difficult to tell which atomicAdd is being executed.

bdice · 2024-01-29T20:06:30Z

device_atomics.cuh is to specialize for the types that are not natively supported by cuda atomicXXX functions. As such, in case of calling to atomicAdd for uint64_t, the compiler should call to the native atomicAdd function, not cudf::detail::atomicAdd function.

The overloads for float and double just pass through. We should do the same here for uint32_t and uint64_t, if my analysis is correct. We should always use cudf's internal implementation instead of the CUDA atomicAdd.

For uint64_t , if cudf::detail::atomicAdd is called, which in turn calls to the generic CAS-while loop, there should be something wrong with the call chain.

I'm tracing the call chain myself. I think it goes from aggregation::SUM, which calls atomicAdd and falls back to this generic case which calls this CAS loop.

The caller is in cudf::detail:: namespace, and it just calls atomicAdd thus it is difficult to tell which atomicAdd is being executed.

I'd be okay with adding an explicit namespace where we intend this to be called. I agree the name conflict is not ideal.

PointKernel · 2024-01-29T20:18:45Z

I'm not sure if it is supported for unsigned types.

cuda::atomic_ref works fine for 4-byte and 8-byte types but doesn't support bool or 2-byte types like int16_t. We can get rid of device_atomics.cuh once NVIDIA/cccl#1024 is resolved.

I'd be okay with adding an explicit namespace where we intend this to be called. I agree the name conflict is not ideal.

👍 explicit namespace or snake-case naming is probably the best temporary workaround before the cccl fix.

GregoryKimball · 2024-01-31T04:23:44Z

OK, thank you everyone for this discussion. If I understand the consensus solution correctly we should:

add specializations for uint32_t and uint64_t as per the comment
update namespace for atomicAdd in aggregation.cuh to specify the cudf::detail:: namespace
extend the namespace updates to atomicMul atomicMin atomicMax in aggregation.cuh

…operators to detail namespace. (#14962) This PR does a thorough refactoring of `device_atomics.cuh`. - I moved all atomic-related functions to `cudf::detail::` (making this an API-breaking change, but most likely a low-impact break) - I added all missing operators for natively supported types to `atomicAdd`, `atomicMin`, `atomicMax`, etc. as discussed in #10149 and #14907. - This should prevent fallback to the `atomicCAS` path for types that are natively supported for those atomic operators, which we suspect as the root cause of the performance regression in #14886. - I kept `atomicAdd` rather than `cudf::detail::atomic_add` in locations where a native CUDA overload exists, and the same for min/max/CAS operations. Aggregations are the only place where we use the special overloads. We were previously calling the native CUDA function rather than our special overloads in many cases, so I retained the previous behavior. This avoids including the additional headers that implement an unnecessary level of wrapping for natively supported overloads. - I enabled native 2-byte CAS operations (on `unsigned short int`) that eliminate the do-while loop and extra alignment-checking logic - The CUDA docs don't state this, but some forum posts claim this is only supported by compute capability 7.0+. We now have 7.0 as a lower bound for RAPIDS so I'm not concerned by this as long as builds/tests pass. - I improved/cleaned the documentation and moved around some code so that the operators were in a logical order. - I assessed the existing tests and it looks like all the types are being covered. I'm not sure if there is a good way to enforce that certain types (like `uint64_t`) are passing through native `atomicAdd` calls. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Suraj Aralihalli (https://github.com/SurajAralihalli) URL: #14962

revert sum agg

15102a9

SurajAralihalli requested a review from a team as a code owner January 26, 2024 23:31

SurajAralihalli requested review from ttnghia and davidwendt January 26, 2024 23:31

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 26, 2024

SurajAralihalli changed the base branch from branch-24.04 to branch-24.02 January 26, 2024 23:32

GregoryKimball changed the title ~~Revert sum/product aggregation to alway produce int64_t type~~ Revert sum/product aggregation to always produce int64_t type Jan 26, 2024

davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jan 27, 2024

davidwendt approved these changes Jan 27, 2024

View reviewed changes

ttnghia approved these changes Jan 27, 2024

View reviewed changes

karthikeyann approved these changes Jan 29, 2024

View reviewed changes

raydouglass merged commit 7cd3834 into rapidsai:branch-24.02 Jan 29, 2024
68 of 69 checks passed

GregoryKimball mentioned this pull request Feb 1, 2024

[BUG] Sum and multiply aggregations promote unsigned input types to a signed output #10149

Open

bdice mentioned this pull request Feb 3, 2024

Add missing atomic operators, refactor atomic operators, move atomic operators to detail namespace. #14962

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert sum/product aggregation to always produce `int64_t` type #14907

Revert sum/product aggregation to always produce `int64_t` type #14907

SurajAralihalli commented Jan 26, 2024 •

edited

Loading

ttnghia commented Jan 26, 2024

GregoryKimball commented Jan 26, 2024 •

edited

Loading

ttnghia left a comment

karthikeyann commented Jan 29, 2024 •

edited

Loading

ttnghia commented Jan 29, 2024 •

edited

Loading

GregoryKimball commented Jan 29, 2024 •

edited

Loading

bdice commented Jan 29, 2024 •

edited

Loading

davidwendt commented Jan 29, 2024

ttnghia commented Jan 29, 2024 •

edited

Loading

bdice commented Jan 29, 2024 •

edited

Loading

PointKernel commented Jan 29, 2024

GregoryKimball commented Jan 31, 2024 •

edited

Loading

Revert sum/product aggregation to always produce int64_t type #14907

Revert sum/product aggregation to always produce int64_t type #14907

Conversation

SurajAralihalli commented Jan 26, 2024 • edited Loading

Description

Checklist

ttnghia commented Jan 26, 2024

GregoryKimball commented Jan 26, 2024 • edited Loading

ttnghia left a comment

Choose a reason for hiding this comment

karthikeyann commented Jan 29, 2024 • edited Loading

ttnghia commented Jan 29, 2024 • edited Loading

GregoryKimball commented Jan 29, 2024 • edited Loading

bdice commented Jan 29, 2024 • edited Loading

davidwendt commented Jan 29, 2024

ttnghia commented Jan 29, 2024 • edited Loading

bdice commented Jan 29, 2024 • edited Loading

PointKernel commented Jan 29, 2024

GregoryKimball commented Jan 31, 2024 • edited Loading

Revert sum/product aggregation to always produce `int64_t` type #14907

Revert sum/product aggregation to always produce `int64_t` type #14907

SurajAralihalli commented Jan 26, 2024 •

edited

Loading

GregoryKimball commented Jan 26, 2024 •

edited

Loading

karthikeyann commented Jan 29, 2024 •

edited

Loading

ttnghia commented Jan 29, 2024 •

edited

Loading

GregoryKimball commented Jan 29, 2024 •

edited

Loading

bdice commented Jan 29, 2024 •

edited

Loading

ttnghia commented Jan 29, 2024 •

edited

Loading

bdice commented Jan 29, 2024 •

edited

Loading

GregoryKimball commented Jan 31, 2024 •

edited

Loading