-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590
Conversation
Unlike past PRs, in this PR I've done both the encoder and decoder. One reason is that the changes here are smaller than in previous delta work. A better reason is that by adding both, I can test the reader and writer at the same time without having to rely on python tests. That said, this PR does expand the existing python tests anyway...perhaps too much. Reviewers, please let me know if the delta tests in A note on the encoding itself. DELTA_LENGTH_BYTE_ARRAY is much like PLAIN encoding for strings. The difference is that the lengths are encoded first, and then the string data. The theory is that separating the two will lead to better compression. The other difference is that the lengths are DELTA_BINARY_PACKED encoded, so there should be a space savings. Because the string data is in one big chunk per page, the processing is much simpler than with DELTA_BYTE_ARRAY or even PLAIN. On the decode side, all we need to do is decode the length data into the offsets child column, and then do an exclusive scan on this to turn lengths into offsets. The character data can just be parallel memcpy'd straight into the chars child column. On the encoding side, it's a similar story. First encode the length data, and then memcpy the chars data into the page buffer. Of course, skip_rows complicates things some, but not too much. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full pass!
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nitpick. Non-blocking.
@@ -1352,8 +1352,13 @@ def test_delta_binary(nrows, add_nulls, dtype, tmpdir): | |||
|
|||
@pytest.mark.parametrize("nrows", delta_num_rows()) | |||
@pytest.mark.parametrize("add_nulls", [True, False]) | |||
@pytest.mark.parametrize("str_encoding", ["DELTA_BYTE_ARRAY"]) | |||
def test_delta_byte_array_roundtrip(nrows, add_nulls, str_encoding, tmpdir): | |||
@pytest.mark.parametrize("string_len", [12, 48, 96, 128]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this parameter is used for max_string_length
below, would suggest keep the name consistent here.
/ok to test |
/merge |
Description
Part of #13501. This adds the ability to read and write Parquet pages with DELTA_LENGTH_BYTE_ARRAY encoding.
Checklist