-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Refactors and next steps for segmented reductions #10432
Comments
I'd suggest moving segmented (and regular) reductions to use a scheme like in groupby/rolling where every aggregation has a preprocessing/finalization steps: cudf/cpp/include/cudf/detail/aggregation/aggregation.hpp Lines 161 to 166 in b1ea304
|
This issue has been labeled |
Still in development. |
This issue has been labeled |
This issue has been labeled |
Fixes reduction gtests source files coded in namespace `cudf::test` No function or test has changed just the source code reworked per namespaces. Fixing this ahead of any changes for #10432 Reference #11734 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) URL: #12257
Fixes replace gtests source files coded in namespace `cudf::test` This only required fixing `replace_nans_tests.cpp` No function or test has changed just the test source code reworked per namespaces. Fixing this ahead of any changes for #10432 Reference #11734 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) URL: #12270
This removes/updates some `TODO` comments from the code after discussions on issue #10432 with @davidwendt. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #12528
After merging #12573, we can update this to use segmented reductions detail APIs like cudf/cpp/include/cudf/detail/null_mask.cuh Line 299 in 3fa081a
|
Adds mean, variance, and standard deviation aggregation support to `cudf::segmented_reduce`. These are compound (multi-step) aggregations and are modeled after the same aggregations supported but `cudf::reduce`. Once this approved and merged, the visitor pattern for this approach will be reworked for both `cudf::reduce` and `cudf::segmented_reduce` as per [#10432](#10432 (comment)). The source tree for `src/reductions` has been adjusted to put all segmented-reduce source files into `src/reductions/segmented` and removing the `segmented_` prefix from those file names. Also, the segmented-reduce functions have been moved from `cudf/detail/reduction_functions.hpp` into their own `cudf/detail/segmented_reduction_functions.hpp`. Likewise, the segmented-reduce CUB calls have been moved from `cudf/detail/reduction.cuh` to the new `cudf/detail/segmented_reduction.cuh` to help minimize including CUB headers. Additionally, the sum-of-squares aggregation is also included since it was a simple reduction only requiring the appropriate aggregation class registration and source file. Finally, gtests are added for these new types. The compound types only support floating-point outputs. Follow on PRs will address the visitor pattern already mentioned above as well as additional data types. Discussion on additional aggregations will occur in the reference issue #10432. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - AJ Schmidt (https://github.com/ajschmidt8) - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: #12573
Reworks some internal source specific to fixed-point types using `cudf::reduce` by removing the duplicated code logic. This was found while working on #12573 and #10432. Since the fix is requires no dependencies, this separate PR is used to minimize code review churn. This should help with code consistency with the fixed-point-specific logic when added to segmented-reduction. No function has changed so all existing gtests are adequate. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #12652
Depends on #12573 Adds additional support for fixed-point types in `cudf::segmented_reduce` for simple aggregations: sum, product, and sum-of-squares. Reference: #10432 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #12680
Adds support for `NUNIQUE` aggregation type for `cudf::segmented_reduce`. This computes the number of unique elements within each segment specified. Due to the overhead of sorting, the segments must be sorted before calling this function otherwise the results are undefined. Also, only non-nested column types are supported as well. Reference #10432 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Divye Gala (https://github.com/divyegala) - Karthikeyan (https://github.com/karthikeyann) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12972
This is issue contains a few proposals for improving the segmented reduction code introduced in #9621.
Investigate sort-groupby aggregations
(Idea from @ttnghia)
With the ability to perform segmented reductions, sort-based groupby may be able to use
group_offsets
to define its segments, rather than materializing a full column of sorted/monotonicgroup_labels
. In effect, this allows us to replace a call tothrust::reduce_by_key
algorithm with a call tocub::DeviceSegmentedReduce::Reduce
, while eliminating the need to compute thegroup_labels
column. I think this should be a more efficient algorithm, and also will require less intermediate memory allocation. Benchmarks should be performed when making this change.Refactor internal use of indices to 2N style (match CUB)
The indexing scheme used for segmented reduction is currently "N+1", like how list offsets are indexed. We want to refactor this to use "2N" indexing. This would align with
cub::DeviceSegmentedReduce::Reduce
and permit greater flexibility in the API internals. See discussion here for details.Compound reductions like mean
The segmented reduction code currently supports "simple" reductions. Support for "compound" reductions is needed. This includes multi-step calculations like mean, standard deviation, or sum of squares. Non-segmented compound reductions are defined here: https://github.com/rapidsai/cudf/blob/c1638869116aae2c6dde6024394279a2fb79e685/cpp/src/reductions/compound.cuh
Fixes for output_type precision
@isVoid and I filed #9988 while working on #9621 because the documentation doesn't align with the implementation for when data is cast to the output dtype relative to when the reduction is performed. This affects segmented reduction as well.
Explore rewriting
get_null_replacing_element_transformer
with nullateIt may be possible to clean up the implementation of null element handling here by using nullate.
Extend to more data types
We need to review the types supported by non-segmented reductions and ensure that segmented reductions support the same types. Decimal support has been requested here: #10417
The text was updated successfully, but these errors were encountered: