Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow initial value for cudf::reduce and cudf::segmented_reduce. #11002

Closed
bdice opened this issue May 27, 2022 · 0 comments · Fixed by #11137
Closed

[FEA] Allow initial value for cudf::reduce and cudf::segmented_reduce. #11002

bdice opened this issue May 27, 2022 · 0 comments · Fixed by #11137
Assignees
Labels
feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.

Comments

@bdice
Copy link
Contributor

bdice commented May 27, 2022

Problem statement

The algorithms cudf::reduce and cudf::segmented_reduce do not accept an initial value.

Similar algorithms like std::reduce, thrust::reduce, cub::DeviceReduce, and cub::DeviceSegmentedReduce all support providing an initial value.

Providing an initial value for reductions is a common need. This is related to #10455, and is a more general statement of the problem described in that issue.

Describe the solution you'd like
I propose adding method overloads like the following:

std::unique_ptr<scalar> reduce(
  column_view const& col,
  std::unique_ptr<reduce_aggregation> const& agg,
  data_type output_dtype,
  scalar const& init,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

std::unique_ptr<column> segmented_reduce(
  column_view const& segmented_values,
  device_span<size_type const> offsets,
  segmented_reduce_aggregation const& agg,
  data_type output_dtype,
  null_policy null_handling,
  scalar const& init,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Design and expected results
The design proposed below was discussed with @gerashegalov @nvdbaranec and @SrikarVanavasam.

For each piece of input data below, I show the current behavior of a reduction and the two proposed behaviors for a null and a non-null initial value. The input data could be a column of values or a single segment of a segmented reduction -- the results are identical.

Input data to sum Current behavior (no initial value) 0 for initial value Null scalar for initial value
[] null 0 null
[1, 2] (include) 3 3 null
[1, 2] (exclude) 3 3 3
[null] (include) null null null
[null] (exclude) null 0 null
[1, null] (include) null null null
[1, null] (exclude) 1 1 1

Note that although 0 is the identity for a sum reduction, the current behavior differs from the results of explicitly providing an initial value that is the identity in the case of empty input data.

Proposed Implementation 1

If the input scalar is valid, provide the value to cub::DeviceSegmentedReduce instead of the binary operator identity. If the input scalar is null, supply the binary operator identity as in the current code.

For normal reductions (non-segmented), the validity logic is straightforward to update: the output is valid if the initial value and reduction result are valid. There may be an early exit case where the output scalar must be null if the initial value is null, allowing the reduction kernel to be skipped (unless the binary operation is special in its null handling, like NULL_MAX?).

For segmented reductions, alter the validity predicate in segmented_null_mask_reduction or perform postprocessing of that bitmask to incorporate the input scalar's nullity and match the expected results in the proposal above.

Proposed Implementation 2

Implement the initial value behavior by performing the current segmented reduction logic, and then performing the corresponding binary operation between the initial value scalar and the result column. Then, for any empty inputs (empty columns or empty segments), the initial value should replace the result value instead. (Otherwise an initial value of 0 and an empty input [] would result in 0 + null = null.)

@bdice bdice added feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code. labels May 27, 2022
@SrikarVanavasam SrikarVanavasam self-assigned this Jun 17, 2022
rapids-bot bot pushed a commit that referenced this issue Jul 13, 2022
)

Closes #11002

Previously, the algorithms [cudf::reduce](https://docs.rapids.ai/api/libcudf/nightly/group__aggregation__reduction.html#gaabf4fa288981643cca147633bdf58035) and [cudf::segmented_reduce](https://docs.rapids.ai/api/libcudf/nightly/group__aggregation__reduction.html#gae36b126703c20e1836f5eb02adaa965d) did not accept an initial value. This PR overloads of the algorithms that do accept an initial value. Initial values are not yet supported for segmented string reductions and can be addressed in a second PR

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - MithunR (https://github.com/mythrocks)

URL: #11137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants