-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433
Add support for tdigest and merge_tdigest aggregations through cudf::reduce #10433
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10433 +/- ##
================================================
+ Coverage 86.13% 86.18% +0.04%
================================================
Files 139 139
Lines 22438 22468 +30
================================================
+ Hits 19328 19363 +35
+ Misses 3110 3105 -5
Continue to review full report at Codecov.
|
…tdigest aggregations is STRUCT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake LGTM.
struct make_centroid_no_nulls { | ||
column_device_view const col; | ||
|
||
centroid operator() __device__(size_type index) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've heard we need to replace thrust::tuple
with cuda::std::tuple
.
offset_type const* group_offsets; | ||
|
||
thrust::pair<double, int> operator() __device__(double next_limit, size_type group_index) | ||
thrust::pair<double, int> operator() __device__(double next_limit, size_type group_index) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thing about thrust::pair
* | ||
* This functor assumes the weight for all scalars is simply 1. Under this assumption, | ||
* the nearest weight that will be <= the next limit is simply the nearest integer < the limit, | ||
* which we can get by just taking floor(next_limit). For example if our next limit is 3.56, the | ||
* nearest whole number <= it is floor(3.56) == 3. | ||
*/ | ||
struct nearest_value_scalar_weights { | ||
struct nearest_value_scalar_weights_grouped { | ||
offset_type const* group_offsets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer device_span
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
rerun tests |
@gpucibot merge |
Previously, these aggregations only worked with groupby. Now they can be invoked through
cudf::reduce
, producing scalar tdigest values (which under the hood are simply struct columns with 1 row).The difference between the groupby and reduce versions is minimal. They are both fundamentally
reduce_by_key
operations, where the keys represent the bucketing of input values to merged centroids. In the case ofgroupby
, the keys are further partitioned by the specific input group. So the bulk of the changes are simply adding a few extra template parameters to various internal functions to allow thereduce
path to behave as if it were just a constant group.Similarly, many of the groupby tests which involved single groups have been refactored/repurposed for the reduce tests.
Most of the important changes are in
tdigest_aggregation.cu