Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make calling to purge_nonempty_nulls optional in various places #12567

Closed
ttnghia opened this issue Jan 18, 2023 · 4 comments
Closed

[FEA] Make calling to purge_nonempty_nulls optional in various places #12567

ttnghia opened this issue Jan 18, 2023 · 4 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jan 18, 2023

There were reported performance regressions in make_lists_column and make_structs_column recently after calling to purge_nonempty_nulls has been added to these factory functions. We need to sanitize (i.e., remove non-empty nulls) for the input data but both checking and removing non-empty nulls may incur some (even significant) overhead.

I propose adding a parameter to the callers of purge_nonempty_nulls such as:

superimpose_nulls(..., std::unique_ptr<column>&& input, bool sanitize_input, ....);

By having such parameter (bool sanitize_input), we can make the calls to purge_nonempty_nulls optional. In some places such as data IO or some custom kernel, we know for sure that all the nulls are empty thus we will not have to waste the overhead of checking non-empty nulls.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jan 18, 2023
@ttnghia ttnghia self-assigned this Jan 18, 2023
@ttnghia ttnghia added the Performance Performance related issue label Jan 18, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Jan 18, 2023

Targeting 23.04 release.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 2, 2023
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Apr 2, 2023
@GregoryKimball
Copy link
Contributor

I would like to connect this issue and #12786. I would like to identify any libcudf algorithms that generate nonempty nulls and only add sanitization where it is needed due to implementation details for particular algorithms.

@GregoryKimball GregoryKimball removed the status in libcudf Apr 7, 2023
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Oct 11, 2023
@GregoryKimball GregoryKimball moved this from Needs owner to To be revisited in libcudf Oct 26, 2023
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 5, 2024

To add some context to this work, we recently added null sanitization checks in #14559, and also started simplifying nulls checking in #13312.

It's also worth mentioning that purge_nonempty_null has caused degenerate performance with copying columns with deeply nested structs (>16 levels). See #14363 for a patch that helped us avoid null sanitization for columns with no parent nulls. The performance observations are captured in issue #TBD.

We should continue to minimize the usage of purge_nonempty_nulls, as part of following our policy "libcudf expects nested types to have sanitized null masks" in the developer guide).

@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 18, 2024

Close this as it is replaced by #17356.

@ttnghia ttnghia closed this as completed Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
Status: To be revisited
Development

No branches or pull requests

2 participants