Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] customizable default for segmented_reduce value on empty segments #10455

Closed
gerashegalov opened this issue Mar 18, 2022 · 8 comments
Closed
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@gerashegalov
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently segmented_reduce returns NULL for various inputs, including empty inputs. For empty inputs NULL is not always a natural choice. E.g. when implementing a Boolean aggregation exists NVIDIA/spark-rapids#4973 for Spark SQL, empty input quite intuitively results in false

Describe the solution you'd like
The most generic solution is to simply to allow the user to pass the initial value for reduction similar to Scala fold

Describe alternatives you've considered
As a workaround, we have to compute a Bool column for the post processing step of fixing up "wrong" results.

@gerashegalov gerashegalov added feature request New feature or request Needs Triage Need team to review and classify labels Mar 18, 2022
@gerashegalov gerashegalov added the Spark Functionality that helps Spark RAPIDS label Mar 18, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jrhemstad
Copy link
Contributor

Makes sense to me. No different than cub's segmented reduce taking an initial value.

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 25, 2022
@gerashegalov gerashegalov changed the title [FEA] customizable default for segmented_reduce value on empty lists [FEA] customizable default for segmented_reduce value on empty segments May 25, 2022
@bdice
Copy link
Contributor

bdice commented May 25, 2022

Providing an initial value for the reduction is different than post-processing to replace null values. Which do you want? There's an implicit 'operator identity' right now as the default value, and its behavior wrt null inclusion/exclusion has the potential to change if a default value besides the identity is allowed.

For example, consider the operator +, initial value of 0 (the operator identity), and segment data [null]. Currently, the behavior is to return null for both null_policy::INCLUDE and null_policy::EXCLUDE. It seems that this proposed change could mean either of:

  1. reduce(+, 0, [null], include) == null, reduce(+, 0, [null], exclude) == 0
    • Equivalent to left-fold with null semantics from the initial value
    • Note the behavior change from null to 0 for exclude!
  2. reduce(+, 0, [null], include) == 0, reduce(+, 0, [null], exclude) == 0
    • Equivalent to trivial post-processing to replace null with 0

If it's (1) you're seeking, this could be in libcudf. If it's (2) you're seeking, I would recommend a trivial postprocessing.

Here's a larger matrix of examples (edit: I'm re-reading/editing this to ensure it's aligned with my expectations):

Click to expand
Initial value Segment Data Nulls Current behavior Result of left fold (+) from initial value Result of post-processed null replacement
(identity) [] Include null 0 (currently gives null!) 0
(identity) [] Exclude null 0 (currently gives null!) 0
(identity) [1, 2] Include 3 3 3
(identity) [1, 2] Exclude 3 3 3
(identity) [null] Include null null 0
(identity) [null] Exclude null 0 (currently gives null!) 0
(identity) [1, null] Include null null 0
(identity) [1, null] Exclude 1 1 1
42 [] Include 42 42
42 [] Exclude 42 42
42 [1, 2] Include 45 ???
42 [1, 2] Exclude 45 ???
42 [null] Include null 42
42 [null] Exclude 42 42
42 [1, null] Include null 42
42 [1, null] Exclude 43 ???

@jrhemstad
Copy link
Contributor

Providing an initial value for the reduction is different than post-processing to replace null values.

Unless I'm misunderstanding, I don't think @gerashegalov is asking for post-processing to replace nulls.

He's saying that when there is an empty segment it currently returns null. It would be nice instead if an empty segment would just return the init value instead.

This is a natural extension of howstd::reduce on an empty input range will just return the initial value.

https://godbolt.org/z/e8TYo4Yd6

I was a little surprised that we don't have an initial_value argument in cudf::reduce.

@gerashegalov
Copy link
Contributor Author

Unless I'm misunderstanding, I don't think @gerashegalov is asking for post-processing to replace nulls.

Correct, we have already implemented postprocessing because there is no way to specify the initial value for aggregation https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/higherOrderFunctions.scala#L398-L402

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022
@davidwendt
Copy link
Contributor

Both cudf::reduce and cudf::segmented_reduce accept an initial-value parameter now.
If you pass false as an initial value for the ANY aggregation type (cudf::make_any_aggregation<cudf::segmented_reduce_aggregation>()) then empty segments will return false and not null. Null rows are still controlled by the null_policy parameter.

input = [0,0], [1,1], [1,0], [0,1,1], [], [1,null], [0,null]
init_value = false
result = cudf::segmented_reduce(input, ANY, BOOL8, EXCLUDE, init_value)
result -> [false, true, true, true, false, true, false]

result = cudf::segmented_reduce(input, ANY, BOOL8, INCLUDE, init_value)
result -> [false, true, true, true, false, null, null]

Without the initial value, the empty segments will result in a null row.

rapids-bot bot pushed a commit that referenced this issue Mar 15, 2023
Minor fix to the `SegmentedReductionTest/AnyExcludeNulls` gtest to use `0/false` as the initial value to better test and demonstrate the usage. Found this when looking for an example to answer issue #10455 
Also reworked the code in this test to use variables to help minimize copy errors and shorten the code size.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - https://github.com/nvdbaranec
  - Bradley Dice (https://github.com/bdice)

URL: #12940
@vyasr
Copy link
Contributor

vyasr commented Mar 15, 2023

Can we close this now, then? Or is there an outstanding ask remaining?

@gerashegalov
Copy link
Contributor Author

Can we close this now, then? Or is there an outstanding ask remaining?

Thanks, I don't believe so. We'll open more issues if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants