[FEA] customizable default for segmented_reduce value on empty segments #10455

gerashegalov · 2022-03-18T18:27:55Z

Is your feature request related to a problem? Please describe.
Currently segmented_reduce returns NULL for various inputs, including empty inputs. For empty inputs NULL is not always a natural choice. E.g. when implementing a Boolean aggregation exists NVIDIA/spark-rapids#4973 for Spark SQL, empty input quite intuitively results in false

Describe the solution you'd like
The most generic solution is to simply to allow the user to pass the initial value for reduction similar to Scala fold

Describe alternatives you've considered
As a workaround, we have to compute a Bool column for the post processing step of fixing up "wrong" results.

The text was updated successfully, but these errors were encountered:

github-actions · 2022-04-17T19:02:42Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jrhemstad · 2022-05-25T20:08:28Z

Makes sense to me. No different than cub's segmented reduce taking an initial value.

bdice · 2022-05-25T21:04:19Z

Providing an initial value for the reduction is different than post-processing to replace null values. Which do you want? There's an implicit 'operator identity' right now as the default value, and its behavior wrt null inclusion/exclusion has the potential to change if a default value besides the identity is allowed.

For example, consider the operator +, initial value of 0 (the operator identity), and segment data [null]. Currently, the behavior is to return null for both null_policy::INCLUDE and null_policy::EXCLUDE. It seems that this proposed change could mean either of:

reduce(+, 0, [null], include) == null, reduce(+, 0, [null], exclude) == 0
- Equivalent to left-fold with null semantics from the initial value
- Note the behavior change from null to 0 for exclude!
reduce(+, 0, [null], include) == 0, reduce(+, 0, [null], exclude) == 0
- Equivalent to trivial post-processing to replace null with 0

If it's (1) you're seeking, this could be in libcudf. If it's (2) you're seeking, I would recommend a trivial postprocessing.

Here's a larger matrix of examples (edit: I'm re-reading/editing this to ensure it's aligned with my expectations):

Click to expand

Initial value	Segment Data	Nulls	Current behavior	Result of left fold (+) from initial value	Result of post-processed null replacement
(identity)	[]	Include	null	0 (currently gives null!)	0
(identity)	[]	Exclude	null	0 (currently gives null!)	0
(identity)	[1, 2]	Include	3	3	3
(identity)	[1, 2]	Exclude	3	3	3
(identity)	[null]	Include	null	null	0
(identity)	[null]	Exclude	null	0 (currently gives null!)	0
(identity)	[1, null]	Include	null	null	0
(identity)	[1, null]	Exclude	1	1	1
42	[]	Include		42	42
42	[]	Exclude		42	42
42	[1, 2]	Include		45	???
42	[1, 2]	Exclude		45	???
42	[null]	Include		null	42
42	[null]	Exclude		42	42
42	[1, null]	Include		null	42
42	[1, null]	Exclude		43	???

jrhemstad · 2022-05-25T21:32:48Z

Providing an initial value for the reduction is different than post-processing to replace null values.

Unless I'm misunderstanding, I don't think @gerashegalov is asking for post-processing to replace nulls.

He's saying that when there is an empty segment it currently returns null. It would be nice instead if an empty segment would just return the init value instead.

This is a natural extension of howstd::reduce on an empty input range will just return the initial value.

https://godbolt.org/z/e8TYo4Yd6

I was a little surprised that we don't have an initial_value argument in cudf::reduce.

gerashegalov · 2022-05-25T21:43:58Z

Unless I'm misunderstanding, I don't think @gerashegalov is asking for post-processing to replace nulls.

Correct, we have already implemented postprocessing because there is no way to specify the initial value for aggregation https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/higherOrderFunctions.scala#L398-L402

davidwendt · 2023-03-14T15:56:38Z

Both cudf::reduce and cudf::segmented_reduce accept an initial-value parameter now.
If you pass false as an initial value for the ANY aggregation type (cudf::make_any_aggregation<cudf::segmented_reduce_aggregation>()) then empty segments will return false and not null. Null rows are still controlled by the null_policy parameter.

input = [0,0], [1,1], [1,0], [0,1,1], [], [1,null], [0,null]
init_value = false
result = cudf::segmented_reduce(input, ANY, BOOL8, EXCLUDE, init_value)
result -> [false, true, true, true, false, true, false]

result = cudf::segmented_reduce(input, ANY, BOOL8, INCLUDE, init_value)
result -> [false, true, true, true, false, null, null]

Without the initial value, the empty segments will result in a null row.

Minor fix to the `SegmentedReductionTest/AnyExcludeNulls` gtest to use `0/false` as the initial value to better test and demonstrate the usage. Found this when looking for an example to answer issue #10455 Also reworked the code in this test to use variables to help minimize copy errors and shorten the code size. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) URL: #12940

vyasr · 2023-03-15T20:50:54Z

Can we close this now, then? Or is there an outstanding ask remaining?

gerashegalov · 2023-03-15T22:13:41Z

Can we close this now, then? Or is there an outstanding ask remaining?

Thanks, I don't believe so. We'll open more issues if needed

gerashegalov added feature request New feature or request Needs Triage Need team to review and classify labels Mar 18, 2022

gerashegalov added the Spark Functionality that helps Spark RAPIDS label Mar 18, 2022

github-actions bot added the inactive-30d label Apr 17, 2022

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 25, 2022

gerashegalov changed the title ~~[FEA] customizable default for segmented_reduce value on empty lists~~ [FEA] customizable default for segmented_reduce value on empty segments May 25, 2022

bdice mentioned this issue May 27, 2022

[FEA] Allow initial value for cudf::reduce and cudf::segmented_reduce. #11002

Closed

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022

davidwendt mentioned this issue Mar 14, 2023

Fix cudf::segmented_reduce gtest for ANY aggregation #12940

Merged

3 tasks

gerashegalov mentioned this issue Mar 14, 2023

Utilize cudf::(segmented_)reduce with the default value in spark-rapids NVIDIA/spark-rapids#7888

Open

gerashegalov closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] customizable default for segmented_reduce value on empty segments #10455

[FEA] customizable default for segmented_reduce value on empty segments #10455

gerashegalov commented Mar 18, 2022

github-actions bot commented Apr 17, 2022

jrhemstad commented May 25, 2022

bdice commented May 25, 2022 •

edited

Loading

jrhemstad commented May 25, 2022

gerashegalov commented May 25, 2022

davidwendt commented Mar 14, 2023

vyasr commented Mar 15, 2023

gerashegalov commented Mar 15, 2023

[FEA] customizable default for segmented_reduce value on empty segments #10455

[FEA] customizable default for segmented_reduce value on empty segments #10455

Comments

gerashegalov commented Mar 18, 2022

github-actions bot commented Apr 17, 2022

jrhemstad commented May 25, 2022

bdice commented May 25, 2022 • edited Loading

jrhemstad commented May 25, 2022

gerashegalov commented May 25, 2022

davidwendt commented Mar 14, 2023

vyasr commented Mar 15, 2023

gerashegalov commented Mar 15, 2023

bdice commented May 25, 2022 •

edited

Loading