-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing int8 column option from parquet byte_array writing #11539
Removing int8 column option from parquet byte_array writing #11539
Conversation
Codecov ReportBase: 87.40% // Head: 88.37% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11539 +/- ##
================================================
+ Coverage 87.40% 88.37% +0.96%
================================================
Files 133 133
Lines 21833 22508 +675
================================================
+ Hits 19084 19892 +808
+ Misses 2749 2616 -133
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the aforementioned concerns about values outside int8's range.
Based on the description, I would've expected this would strictly used UINT8 instead of INT8 since the sign bit has no meaning in plain binary data like this. Also, it would probably help with the boundary issues mentioned in the other review comments as well. |
Co-authored-by: Bradley Dice <[email protected]>
I would typically use unsigned int values for bytes as well and I would agree with you about the string argument if it weren't for the strong coupling between the two that exists. When reading binary, we read it as a string always and can decompose the column and built it again as a list of bytes if desired. The column type that we get from pulling apart a string column is an int8 column, for reasons unknown to me. This seemed to imply a precedence for int8 byte representations inside cudf and also made the easy path for conversion to be list. As a result, reading binary data from a parquet file returns a list type column right now. That said, the easy path doesn't mean that it is the best path by a long shot. If we have consensus that uint8 should be the type we use and expect, I am ok with that, but it will require more than a few changes. It is certainly more natural to fit into the parquet statistics. |
Beyond scope for this PR, but I'd be interested in a discussion about this. Since strings in cuDF are UTF-8, not c-style char arrays, why not use UINT8 as the child column type for both binary and string data? Is it for ease of binding to python strings? |
From what I can gather, when strings were implemented cudf didn't have a uint8 type. I agree that unsigned is a better option, but I wonder the impact that would have externally by users. I think the best options is to create an issue to discuss changing the behavior. |
I don't think consistency with a different type (i.e. |
…quet_binary_uint8_removal
Updated this to change the column type to uint8 instead of int8. Now binary writes only accept uint8 columns and binary reads come back as uint8 columns. |
Also resolves NVIDIA/spark-rapids#6408 with the added changes that standardize casting to list of bytes. There should be no additional changes needed on the plugin side based on my testing + integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with one question. Happy to defer on that if you don't want to make those changes in this PR.
data->size(), | ||
std::move(*data), | ||
std::move(*null_mask), | ||
UNKNOWN_NULL_COUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fetch the null count before releasing the data, and pass it here? We should try to retain any information that may have been previously computed.
Generally, I think we're going to move away from lazily computing null counts as we work on adding streams to the public API. There are some hiccups with the lazy approach and streams, but primarily we aren't being as efficient as possible wrt avoiding calls to the null_count
kernel because we aren't tracking information that we already may have computed once earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of moving away form lazily computed null counts and am happy to change here if we want to start now. In this case, it made sense for me to defer the null counting because byte cast is often use as an intermediate type for reading, writing, and hashing. Those are usually indifferent to null count and it made sense to keep it lazily computed while we still supported it.
… into branch-22.12
Adding do not merge because of some upmerge issues. Upmerged to |
rerun tests |
@gpucibot merge |
…apidsai#11539)" This reverts commit 1effe19.
Description
As suggested in #11526 and captured in issue #11536 the usage of both INT8 and UINT8 as supported types for byte_arrays is unnecessary and adds complexity to the code. This change removes INT8 as an option and only allows UINT8 columns to be written out as byte_arrays.
This matches with cudf string columns which contain an INT8 column for data.closes #11536
Checklist