-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] customizable default for segmented_reduce value on empty segments #10455
Comments
This issue has been labeled |
Makes sense to me. No different than cub's segmented reduce taking an initial value. |
Providing an initial value for the reduction is different than post-processing to replace null values. Which do you want? There's an implicit 'operator identity' right now as the default value, and its behavior wrt null inclusion/exclusion has the potential to change if a default value besides the identity is allowed. For example, consider the operator
If it's (1) you're seeking, this could be in libcudf. If it's (2) you're seeking, I would recommend a trivial postprocessing. Here's a larger matrix of examples (edit: I'm re-reading/editing this to ensure it's aligned with my expectations): Click to expand
|
Unless I'm misunderstanding, I don't think @gerashegalov is asking for post-processing to replace nulls. He's saying that when there is an empty segment it currently returns This is a natural extension of how https://godbolt.org/z/e8TYo4Yd6 I was a little surprised that we don't have an |
Correct, we have already implemented postprocessing because there is no way to specify the initial value for aggregation https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/higherOrderFunctions.scala#L398-L402 |
Both
Without the initial value, the empty segments will result in a null row. |
Minor fix to the `SegmentedReductionTest/AnyExcludeNulls` gtest to use `0/false` as the initial value to better test and demonstrate the usage. Found this when looking for an example to answer issue #10455 Also reworked the code in this test to use variables to help minimize copy errors and shorten the code size. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) URL: #12940
Can we close this now, then? Or is there an outstanding ask remaining? |
Thanks, I don't believe so. We'll open more issues if needed |
Is your feature request related to a problem? Please describe.
Currently segmented_reduce returns NULL for various inputs, including empty inputs. For empty inputs NULL is not always a natural choice. E.g. when implementing a Boolean aggregation
exists
NVIDIA/spark-rapids#4973 for Spark SQL, empty input quite intuitively results infalse
Describe the solution you'd like
The most generic solution is to simply to allow the user to pass the initial value for reduction similar to Scala fold
Describe alternatives you've considered
As a workaround, we have to compute a Bool column for the post processing step of fixing up "wrong" results.
The text was updated successfully, but these errors were encountered: