-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode DELTA_BYTE_ARRAY in Parquet writer #14938
Conversation
/ok to test |
CC @galipremsagar for the Python API changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review
at the beginning of a page correctly
Co-authored-by: Vukasin Milovanovic <[email protected]>
Co-authored-by: Vukasin Milovanovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing. Builds so well on the previous delta encoding work.
Flushing the few potentially relevant comments I got; will make another pass over the core encode implementation, but it's looking great so far.
@@ -201,7 +201,7 @@ class delta_binary_packer { | |||
if (is_valid) { _buffer[delta::rolling_idx(pos + _current_idx + _values_in_buffer)] = value; } | |||
__syncthreads(); | |||
|
|||
if (threadIdx.x == 0) { | |||
if (num_valid > 0 && threadIdx.x == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worthwhile to add a test for the fixed bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nothing avoid the Eye of Sauron???? 🤣 Guess I'll whip something up 🧑🍳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it weren't the only change in the file maybe it could have snuck by :D
}; | ||
|
||
/* | ||
some timing results. remove when done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving a TODO for later in review process
cpp/src/io/parquet/page_enc.cu
Outdated
|
||
auto const type_id = s->col.leaf_column->type().id(); | ||
|
||
auto const get_string_tuple = [type_id, s](int idx) -> thrust::pair<size_type, uint8_t const*> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this return a cudf::string_view? It's basically a (char const*, size_type) pair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string_view
comes with UTF8 baggage...I'd be more inclined to using the byte_array_view
since that's more in line with the physical type. But then that's using std::byte
, which is just not something you'd want to iterate over. I'd be willing to return a struct that has the overlap calculation as a method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few more nitpicks, mostly related to naming
return {reinterpret_cast<uint8_t const*>(str.data()), str.size_bytes()}; | ||
} else if (s->col.output_as_byte_array && type_id == type_id::LIST) { | ||
auto const str = get_element<statistics::byte_array_view>(*s->col.leaf_column, idx); | ||
return {reinterpret_cast<uint8_t const*>(str.data()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does byte_array use unit8_t* instead of std::byte* or char*? I think either of the two are less UBish when it comes to reinterpret_cast
ing, compared to uint8_t*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet uses unsigned bytes for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY. We're doing ==
rather than compare
, but I'd prefer to keep the byte array representation in line with what a byte array in Parquet is.
Co-authored-by: Vukasin Milovanovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing all suggestions. Looks great.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think prefer_dba
is a very good name. Is this really a "preference"? Can the code choose to disregard this preference? I would prefer something like use_delta_byte_array
or enable_delta_byte_array
, since it appears that this always uses DELTA_BYTE_ARRAY
rather than DELTA_LENGTH_BYTE_ARRAY
when the option is enabled.
Also, I want to see dba
expanded to delta_byte_array
throughout the code, to make it easier to search for this option (dba
is not self-explanatory and we rarely use abbreviations). It's currently inconsistent between dba
/delta_byte_array
between internal/external APIs.
@@ -1370,6 +1381,7 @@ def test_delta_byte_array_roundtrip( | |||
nrows, add_nulls, max_string_length, str_encoding, tmpdir | |||
): | |||
null_frequency = 0.25 if add_nulls else 0 | |||
prefer_dba = str_encoding == "DELTA_BYTE_ARRAY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the full name everywhere in this PR. prefer_dba
-> prefer_delta_byte_array
throughout.
prefer_dba = str_encoding == "DELTA_BYTE_ARRAY" | |
prefer_delta_byte_array = str_encoding == "DELTA_BYTE_ARRAY" |
rethinking how to select encodings...closing for now |
Part of #14938 was fixing two bugs discovered during testing. One is in the encoding of DELTA_BINARY_PACKED data where the first non-null value in a page to be encoded is not in the first batch of 129 values. The second is an error in decoding of DELTA_BYTE_ARRAY pages where, again, the first non-null value is not in the first block to be decoded. This PR includes a test for the former, but the latter cannot be easily tested because the python API still lacks `skip_rows`, and we cannot generate DELTA_BYTE_ARRAY encoded data without the changes in #14938. A test for the latter will be added later, but the fix has been validated with data on hand locally. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #15075
Re-submission of #14938. Final (delta) piece of #13501. Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #15239
Description
The last piece of #13501. This adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY.
Checklist