-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix null handling for structs min
and arg_min
in groupby, groupby scan, reduction, and inclusive_scan
#9864
Fix null handling for structs min
and arg_min
in groupby, groupby scan, reduction, and inclusive_scan
#9864
Conversation
min
, max
, arg_min
and arg_max
in groupby, groupby scan, reduction, and inclusive_scanmin
and arg_min
in groupby, groupby scan, reduction, and inclusive_scan
Says who? |
Says Spark:
And struct:
|
libcudf is not Spark. Here's what the docs say happen with nulls in cudf/cpp/include/cudf/reduction.hpp Line 41 in 2e95fb1
cudf/cpp/include/cudf/reduction.hpp Lines 75 to 76 in 2e95fb1
|
Got it. I'll update the PR. Thanks. Update: Done. |
min
and arg_min
in groupby, groupby scan, reduction, and inclusive_scanmin
and arg_min
in groupby, groupby scan, reduction, and inclusive_scan
min
and arg_min
in groupby, groupby scan, reduction, and inclusive_scanmin
, max
, arg_min
and arg_max
in groupby, groupby scan, reduction, and inclusive_scan
min
, max
, arg_min
and arg_max
in groupby, groupby scan, reduction, and inclusive_scanmin
and arg_min
in groupby, groupby scan, reduction, and inclusive_scan
row_arg_minmax_fn(size_type const num_rows, | ||
table_device_view const& table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is num_rows
here the same as table.num_rows()
. table_device_view::num_rows()
is callable on the host, so you could drop the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
auto constexpr is_min_op = K == aggregation::MIN; | ||
auto const binop_generator = | ||
cudf::reduction::detail::comparison_binop_generator(values, is_min_op, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a little cleaner to pass the aggregation to the comparison_binop_generator
and have it do the is_min_op
logic internally. That way the user doesn't have to type it out every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"" /*NULL*/, | ||
"" /*NULL*/, | ||
"" /*NULL*/}, | ||
null_at(2)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null_at(2)
does not seem to match the NULL
comments above. Perhaps I'm reading this wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That child column has null at idx 2. The following nulls are at the parent column which are specified below (nulls_at({3, 4, ...})
). Sorry for the confusion.
Edit: I have added more comments to clarify it.
This comment has been minimized.
This comment has been minimized.
@gpucibot merge |
When finding
min
,max
,arg_min
andarg_max
for structs in groupby, groupby scan, reduction and inclusive_scan operations, null struct rows should be excluded from the operation (but the null rows of its children column are not). The current implementation for structs wrongly includes nulls at all levels, producing wrong results formin
andarg_min
operations.This PR fixes that. In particular, null rows at the children levels are still being handled by the old way (nulls are smaller than non-null elements), but handling nulls at the top parent column level is modified such that:
min
andarg_min
, ormax
andarg_max
.