-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change nullable()
to has_nulls()
in cudf::detail::gather
#14363
Change nullable()
to has_nulls()
in cudf::detail::gather
#14363
Conversation
nullable() to
has_nulls() in
cudf::detail::gather`nullable()
to has_nulls()
in cudf::detail::gather
If I have a column with an allocated null mask but no null values, and call If a null mask is allocated but there are zero null values, it would be fine to take a shortcut and not gather the bitmasks -- just allocate a nullmask with all valid entries before returning. That could be a special-case compromise between schema integrity and performance here. |
@bdice good question, I'll try it out |
Closing in favor of #14366 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@divyegala I can check the failures you saw in CI after it runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with one name change suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if my approval should count given that I've made significant modifications to this PR, but I did review it substantially and think the logic is sound.
Co-authored-by: Bradley Dice <[email protected]>
* @param input The table to check for nullable columns | ||
* @return True if the table has nullable columns at any level of the column hierarchy, false otherwise | ||
*/ | ||
inline bool has_nested_nullable_columns(table_view const& input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this implementation. This could be moved to https://github.com/rapidsai/cudf/blob/branch-23.12/cpp/src/table/table.cpp
This applies to the other non-templated inline
functions in this header as well so I would be ok if this was done in a follow up PR -- in 24.02 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving up to some docstrings additions.
@@ -49,6 +49,7 @@ struct calculate_quantile_fn { | |||
double const* d_quantiles; | |||
size_type num_quantiles; | |||
interpolation interpolation; | |||
size_type* null_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an unrelated change to this PR so ideally it should be in a separate PR. But I'm fine to keep this here but please clarify that in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/include/cudf/detail/gather.cuh
Outdated
auto const has_nulls = | ||
bounds_policy == out_of_bounds_policy::NULLIFY || cudf::has_nested_nulls(source_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use &&
instead? Because if we indeed don't have any nulls here then we don't need to call gather_bitmask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I misunderstood the usage of this variable. So this variable should be called need_new_bitmask
or so. It should not be has_nulls
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need to call gather_bitmask
if out_of_bounds_policy::NULLIFY
. gather_bitmask
will help nullify any OOB accesses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so ||
is indeed needed, but please rename that variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
auto needs_new_bitmask = bounds_policy == out_of_bounds_policy::NULLIFY || | ||
cudf::has_nested_nullable_columns(source_table); | ||
if (needs_new_bitmask) { | ||
needs_new_bitmask = needs_new_bitmask || cudf::has_nested_nulls(source_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking suggestion:
needs_new_bitmask = needs_new_bitmask || cudf::has_nested_nulls(source_table); | |
needs_new_bitmask |= cudf::has_nested_nulls(source_table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to rerun CI again. Happy to make this change if you feel strongly about it, otherwise let's pass on it
/merge |
Description
In #13795, we found out that
nullable()
causes severe perf degradation for the nested-type case when the input is read from file viacudf::io::read_json
. This is because the JSON reader adds a null mask for columns that don't have NULLs. This change is a no-overhead replacement that checks the actual null count instead of checking if a null mask is present.This PR also solves a bug in quantile/median groupby where NULLs were being set but the null count was not updated.
Checklist