-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing crash when writing binary nested data in parquet #11526
Fixing crash when writing binary nested data in parquet #11526
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.10 #11526 +/- ##
===============================================
Coverage ? 86.36%
===============================================
Files ? 145
Lines ? 22949
Branches ? 0
===============================================
Hits ? 19820
Misses ? 3129
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
: col.child(0); | ||
// stop early if writing a byte array, it needs to be a list<int8> not the int8 column | ||
if (col_desc[index].stats_dtype == dtype_byte_array && | ||
(child.type().id() == type_id::INT8 || child.type().id() == type_id::UINT8)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark differentiate binary types stored as int8 and uint8? Are both of these equivalent from Spark’s perspective? Can Spark normalize inputs to use only one or the other? It feels very suspicious to me that we’ve decided to accept either one across the binary code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We represent these as list<int8>
due to the way string columns are created, but use std::byte
in byte_array_view
, which is unsigned. I prefer to represent byte-like things as unsigned. I didn't know exactly where we would land and didn't want to have trouble and so I decided to support both signed and unsigned. I'm happy to discuss this further and come to an agreement, but I would suggest a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::byte
is neither signed nor unsigned. https://godbolt.org/z/bo1rorxGM
Interpreting a std::byte
as a signed or unsigned requires an explicit cast. We can probably pick one and stick with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cpp reference lists it as: enum class byte : unsigned char {} ;
, which is why I said that. I see your point though and it should be agnostic as it is simply a byte. I would personally pick uint8 for the column type, but it seems that int8 has momentum due to its use in string columns. Happy to go either way. Should I make a new issue to address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a new issue sounds good. I'd suggest using int8
to align with strings columns. Moreover, tracking the signed-ness is not relevant to us, since there's no arithmetic operations needed on binary data and certain optimizations may be easier for signed types.
@gpucibot merge |
As suggested in #11526 and captured in issue #11536 the usage of both INT8 and UINT8 as supported types for byte_arrays is unnecessary and adds complexity to the code. This change removes INT8 as an option and only allows UINT8 columns to be written out as byte_arrays. ~~This matches with cudf string columns which contain an INT8 column for data.~~ closes #11536 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Tobias Ribizel (https://github.com/upsj) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #11539
Description
This fixes the crash described in the bug related to writing nested data in parquet with the binary flag set to write binary data as byte_arrays. We were incorrectly selecting the top-most node instead of the list, which resulted in a crash down in the kernels when the data pointer was null for those upper list columns.
closes #11506
Checklist