Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change nullable() to has_nulls() in cudf::detail::gather #14363

Merged
merged 18 commits into from
Nov 16, 2023

Conversation

divyegala
Copy link
Member

@divyegala divyegala commented Nov 6, 2023

Description

In #13795, we found out that nullable() causes severe perf degradation for the nested-type case when the input is read from file via cudf::io::read_json. This is because the JSON reader adds a null mask for columns that don't have NULLs. This change is a no-overhead replacement that checks the actual null count instead of checking if a null mask is present.

This PR also solves a bug in quantile/median groupby where NULLs were being set but the null count was not updated.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@divyegala divyegala added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 6, 2023
@divyegala divyegala requested a review from a team as a code owner November 6, 2023 18:18
@divyegala divyegala self-assigned this Nov 6, 2023
@divyegala divyegala changed the title Change nullable() to has_nulls() in cudf::detail::gather` Change nullable() to has_nulls() in cudf::detail::gather Nov 6, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 6, 2023
@bdice
Copy link
Contributor

bdice commented Nov 6, 2023

If I have a column with an allocated null mask but no null values, and call gather, does this change whether the output column has an allocated null mask? If a null mask is allocated for the input, libcudf should preserve that property on the output even if there are no null values in the column. This is problematic for round-tripping data, as nullability of a column is a part of the schema/metadata even if there are no nulls declared.

If a null mask is allocated but there are zero null values, it would be fine to take a shortcut and not gather the bitmasks -- just allocate a nullmask with all valid entries before returning. That could be a special-case compromise between schema integrity and performance here.

@divyegala
Copy link
Member Author

@bdice good question, I'll try it out

@divyegala
Copy link
Member Author

Closing in favor of #14366

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyegala I can check the failures you saw in CI after it runs.

cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one name change suggestion.

cpp/include/cudf/table/table_view.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if my approval should count given that I've made significant modifications to this PR, but I did review it substantially and think the logic is sound.

cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved
* @param input The table to check for nullable columns
* @return True if the table has nullable columns at any level of the column hierarchy, false otherwise
*/
inline bool has_nested_nullable_columns(table_view const& input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this implementation. This could be moved to https://github.com/rapidsai/cudf/blob/branch-23.12/cpp/src/table/table.cpp
This applies to the other non-templated inline functions in this header as well so I would be ok if this was done in a follow up PR -- in 24.02 too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make an issue for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyegala divyegala requested a review from davidwendt November 15, 2023 15:07
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving up to some docstrings additions.

cpp/include/cudf/detail/null_mask.hpp Outdated Show resolved Hide resolved
@@ -49,6 +49,7 @@ struct calculate_quantile_fn {
double const* d_quantiles;
size_type num_quantiles;
interpolation interpolation;
size_type* null_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an unrelated change to this PR so ideally it should be in a separate PR. But I'm fine to keep this here but please clarify that in the PR description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 679 to 680
auto const has_nulls =
bounds_policy == out_of_bounds_policy::NULLIFY || cudf::has_nested_nulls(source_table);
Copy link
Contributor

@ttnghia ttnghia Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use && instead? Because if we indeed don't have any nulls here then we don't need to call gather_bitmask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I misunderstood the usage of this variable. So this variable should be called need_new_bitmask or so. It should not be has_nulls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to call gather_bitmask if out_of_bounds_policy::NULLIFY. gather_bitmask will help nullify any OOB accesses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so || is indeed needed, but please rename that variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@vyasr vyasr requested a review from ttnghia November 16, 2023 16:53
auto needs_new_bitmask = bounds_policy == out_of_bounds_policy::NULLIFY ||
cudf::has_nested_nullable_columns(source_table);
if (needs_new_bitmask) {
needs_new_bitmask = needs_new_bitmask || cudf::has_nested_nulls(source_table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking suggestion:

Suggested change
needs_new_bitmask = needs_new_bitmask || cudf::has_nested_nulls(source_table);
needs_new_bitmask |= cudf::has_nested_nulls(source_table);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to rerun CI again. Happy to make this change if you feel strongly about it, otherwise let's pass on it

@divyegala
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 8e1ef05 into rapidsai:branch-23.12 Nov 16, 2023
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants