-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader #13437
Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader #13437
Conversation
Can this be deferred until #13302 is finished (or rejected >.<)? I think the fixed width byte arrays can be processed more efficiently since they'll behave like any other fixed-width type (sizing can be known up front). |
I ran some tests with this and it looks good to me. Note that I just did some very simple tests though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is working out. Nice cleanups along the way as well!
@@ -603,7 +611,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s, | |||
void* dstv) | |||
{ | |||
auto [ptr, len] = gpuGetStringData(s, sb, src_pos); | |||
if (s->dtype_len == 4) { | |||
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be a logical and? It seems in the past we would take this path if the data size was 4 bytes, but now we only do it if the data size if 4 bytes AND it is a byte array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, only BYTE_ARRAY
invokes gpuOutputString
and the input length cannot be 4 in that case. Now with function being potentially invoked by FIXED_LEN_BYTE_ARRAY
where the length could be 4, this and
logic is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a comment along the lines of "// make sure to only hash variable length byte arrays when specified with the output type size"?
cpp/src/io/parquet/page_data.cu
Outdated
@@ -2027,11 +2035,14 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodePageData( | |||
return; | |||
} | |||
|
|||
auto const data_type = s->col.data_type & 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked seeing this change above to use this type, but why not move this above and use this variable all around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_type
is not used that often in other functions. Let me have a further look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO ColumnChunkDesc should have data_type()
and type_len()
so the bitwise packing madness is hidden from the rest of the code.
data_type & 7
appears 11 times in the code :\
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use parquet::Type
directly instead of uint16_t
?
cudf/cpp/src/io/parquet/parquet_gpu.hpp
Lines 266 to 267 in 41f0caf
uint16_t data_type; // basic column data type, ((type_length << 3) | | |
// parquet::Type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for some reason) data_type holds the type in the last three bits and length in the rest of the number. Using Type here would make this data member even more misleading.
@@ -603,7 +611,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s, | |||
void* dstv) | |||
{ | |||
auto [ptr, len] = gpuGetStringData(s, sb, src_pos); | |||
if (s->dtype_len == 4) { | |||
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, only BYTE_ARRAY
invokes gpuOutputString
and the input length cannot be 4 in that case. Now with function being potentially invoked by FIXED_LEN_BYTE_ARRAY
where the length could be 4, this and
logic is needed.
So will this take |
Right, similar to
This work is based on the assumption that users may switch between string and binary as the eventual output of
Valid concern, I didn't think about this. |
python/cudf/cudf/tests/data/parquet/fixed_len_byte_array.parquet
Outdated
Show resolved
Hide resolved
…f into parquet-fixed-len-binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small change, but tricky! Cool stuff 👍
Partial review, will continue tomorrow.
@@ -41,7 +41,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s, | |||
void* dstv) | |||
{ | |||
auto [ptr, len] = gpuGetStringData(s, sb, src_pos); | |||
if (s->dtype_len == 4) { | |||
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really related to this PR, but we have this dtype_len == 4
condition as a stand-in for "output hash" in a few places and it really requires detailed knowledge of the code to understand. Nothing actionable, I just don't like it :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just a small suggestion
Co-authored-by: Vukasin Milovanovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reran my tests and they still all pass
Co-authored-by: GALI PREM SAGAR <[email protected]>
/merge |
#13437 added the ability to consume FIXED_LEN_BYTE_ARRAY encoded data and represent it as lists of `UINT8`. When trying to write this data back to Parquet there are two problems. 1) the notion of fixed length is lost, and 2) the `UINT8` data is written as a list of `INT32` which can quadruple the storage required. This PR addresses both issues by adding fields to the input and output metadata to allow for preserving the form of the original data. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #15600
Description
Closes #12590
This PR adds support of reading
FIXED_LEN_BYTE_ARRAY
as lists ofINT8
in the parquet reader.Checklist