Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Either improve support for or remove type_id::EMPTY #12477

Open
vyasr opened this issue Jan 5, 2023 · 1 comment
Open

[FEA] Either improve support for or remove type_id::EMPTY #12477

vyasr opened this issue Jan 5, 2023 · 1 comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@vyasr
Copy link
Contributor

vyasr commented Jan 5, 2023

Is your feature request related to a problem? Please describe.
libcudf supports an empty type, type_id::EMPTY, that is analogous to arrow's null type used to represent a column of all null values. However, functionality for this type is only implemented in pieces and there are likely many cases where libcudf will fail if provided with such a column (#10761 is one somewhat recent example).

Describe the solution you'd like
We should reevaluate the usage of EMPTY columns in libcudf, either removing them altogether or making them work more consistently across the code base. Removal seems like the simplest path forward, but there do appear to be some parts of cuIO that do leverage EMPTY columns, and there's an argument to be made that for conformance with the arrow spec we should maintain this type no matter what. If we keep it, we should make it easier to test APIs with such columns to ensure that they are handled appropriately. We also may need to improve handling of these columns in the higher-level APIs backed by libcudf such as cuDF Python or the Spark plugin.

Additional context
It's worth noting that AFAICT a null column is trivial to optimize storage for since all that's needed is a size (both null mask and data are redundant). I don't think such columns are useful enough to spend much engineering effort on optimizations, though.

@vyasr vyasr added feature request New feature or request Needs Triage Need team to review and classify labels Jan 5, 2023
@revans2
Copy link
Contributor

revans2 commented Jan 5, 2023

Spark also has a NullType, but it is not extensively used. It only shows up in cases where a null is explicitly put into the SQL with no type information at that point. It is never used for any computation that I can see without casting it to a specific type first. So like I said this is very minor and we are fine with out current solution where we store it as a INT8 that is null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants