Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing crash when writing binary nested data in parquet #11526

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions cpp/src/io/utilities/column_utils.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,17 @@ rmm::device_uvector<column_device_view> create_leaf_column_device_views(
size_type index) mutable {
col_desc[index].parent_column = parent_col_view.begin() + index;
column_device_view col = parent_col_view.column(index);
if (col_desc[index].stats_dtype != dtype_byte_array) {
// traverse till leaf column
while (col.type().id() == type_id::LIST or col.type().id() == type_id::STRUCT) {
col = (col.type().id() == type_id::LIST)
? col.child(lists_column_view::child_column_index)
: col.child(0);
// traverse till leaf column
while (cudf::is_nested(col.type())) {
auto const child = (col.type().id() == type_id::LIST)
? col.child(lists_column_view::child_column_index)
: col.child(0);
// stop early if writing a byte array
if (col_desc[index].stats_dtype == dtype_byte_array &&
(child.type().id() == type_id::INT8 || child.type().id() == type_id::UINT8)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark differentiate binary types stored as int8 and uint8? Are both of these equivalent from Spark’s perspective? Can Spark normalize inputs to use only one or the other? It feels very suspicious to me that we’ve decided to accept either one across the binary code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We represent these as list<int8> due to the way string columns are created, but use std::byte in byte_array_view, which is unsigned. I prefer to represent byte-like things as unsigned. I didn't know exactly where we would land and didn't want to have trouble and so I decided to support both signed and unsigned. I'm happy to discuss this further and come to an agreement, but I would suggest a new issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::byte is neither signed nor unsigned. https://godbolt.org/z/bo1rorxGM

Interpreting a std::byte as a signed or unsigned requires an explicit cast. We can probably pick one and stick with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cpp reference lists it as: enum class byte : unsigned char {} ;, which is why I said that. I see your point though and it should be agnostic as it is simply a byte. I would personally pick uint8 for the column type, but it seems that int8 has momentum due to its use in string columns. Happy to go either way. Should I make a new issue to address this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a new issue sounds good. I'd suggest using int8 to align with strings columns. Moreover, tracking the signed-ness is not relevant to us, since there's no arithmetic operations needed on binary data and certain optimizations may be easier for signed types.

break;
}
col = child;
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved
}
// Store leaf_column to device storage
column_device_view* leaf_col_ptr = leaf_columns.begin() + index;
Expand Down
3 changes: 3 additions & 0 deletions cpp/tests/io/parquet_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4250,6 +4250,9 @@ TEST_F(ParquetWriterTest, ByteArrayStats)

read_footer(source, &fmd);

EXPECT_EQ(fmd.schema[1].type, cudf::io::parquet::Type::BYTE_ARRAY);
EXPECT_EQ(fmd.schema[2].type, cudf::io::parquet::Type::BYTE_ARRAY);

auto const stats0 = parse_statistics(fmd.row_groups[0].columns[0]);
auto const stats1 = parse_statistics(fmd.row_groups[0].columns[1]);

Expand Down