-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37939: [C++] Use signed arithmetic for frame of reference when encoding DELTA_BINARY_PACKED #37940
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good idea!
cpp/src/parquet/encoding.cc
Outdated
private: | ||
constexpr T SafeSubtract(T a, T b) const { | ||
constexpr WT mask = static_cast<WT>(static_cast<UT>(static_cast<T>(-1))); | ||
return static_cast<T>((static_cast<WT>(a) - static_cast<WT>(b)) & mask); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this maybe better fit in int_util_overflow.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a better location. Thanks!
Edit: since it needs ::arrow::internal::int128_t
I don't think it can go in int_util_overflow.h
:(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think we don't need an extra SafeSignedSubstractSigned
for some original unsigned value. But this patch is great. I'll checkout parquet-mr and arrow-rs impl here
cpp/src/parquet/encoding.cc
Outdated
// making subtraction operations well-defined and correct even in case of overflow. | ||
// Encoded integers will wrap back around on decoding. | ||
// See http://en.wikipedia.org/wiki/Modular_arithmetic#Integers_modulo_n | ||
deltas_[values_current_block_] = value - current_value_; | ||
deltas_[values_current_block_] = SafeSignedSubtractSigned(value, current_value_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a static_cast<T>(value - current_value_);
be ok here? Originally, value
and current_value_
is unsigned, doing unsigned op would not require an wider type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, we don't need a new safe subtract. The existing SafeSignedSubtract
does the subtraction with unsigned to get the correct wrapping behavior, and then casts back to signed. No need for promotion to the larger type. I've changed to SafeSignedSubtract
everywhere.
cpp/src/parquet/encoding.cc
Outdated
const auto bit_width = bit_width_data[i] = | ||
bit_util::NumRequiredBits(max_delta - min_delta); | ||
const auto bit_width = bit_width_data[i] = bit_util::NumRequiredBits( | ||
static_cast<UT>(SafeSignedSubtractSigned(max_delta, min_delta))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not fully understand here. If the sequence is {1, 0, -1, 0, 1, 0, -1, 0, 1}
Previously, the value glows large because max_delta - min_delta
. The max_delta
should be 1, however, it turns to be a huge unsigned int, and the min_delta
is 1. The result is correct, however it waste huge space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. Leaving everything unsigned, we have (unsigned)(0 - 1) = 0xffffffffU. min_delta == 1
, max_delta == 0xffffffffU
, so max_delta - min_delta == 0xfffffffeU
, which requires the full 32 bits to encode.
The end result is correct, but uses more space than it needs to.
Seems CI failed because some weird problem, would you mind retry and trigger these CI? |
Oh after check arrow-rs( https://github.com/apache/arrow-rs/blob/471f6dd2911d8328ca56efe2f685e08c0a3fb8c8/parquet/src/encodings/encoding/mod.rs#L275 ) and parquet-mr ( See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp/src/parquet/encoding.cc
Outdated
const auto bit_width = bit_width_data[i] = | ||
bit_util::NumRequiredBits(max_delta - min_delta); | ||
const auto bit_width = bit_width_data[i] = bit_util::NumRequiredBits( | ||
static_cast<UT>(SafeSignedSubtract(max_delta, min_delta))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can add a DCHECK
for SafeSignedSubtract(max_delta, min_delta) >= 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would want that. It's ok for the subtraction to wrap around and become negative, we just cast it back to unsigned. On the decoding side it will wrap around again and yield the correct result. The only reason to use SafeSignedSubtract
is to get well defined wrapping behavior, which signed subtraction does not give us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, {0, 0, INT32_MIN, INT32_MAX}
might get SafeSignedSubtract(max_delta, min_delta) < 0
.
@@ -1413,6 +1413,68 @@ TEST_F(TestLargeStringParquetIO, Basics) { | |||
this->RoundTripSingleColumn(large_array, large_array, arrow_properties); | |||
} | |||
|
|||
using TestDeltaBinaryPacked32ParquetIO = TestParquetIO<::arrow::Int32Type>; | |||
|
|||
TEST_F(TestDeltaBinaryPacked32ParquetIO, DeltaBinaryPacked) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit test looks good to me, but can we move them just to encoding_test.cc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Since the tests are identical except for the typing, is it worth the effort to make this a typed test suite, but for just two types?
Oh, I see there is one in encoding_test already. Thanks for pointing me there @mapleFU
(This looks ok to me now, but maybe it's better to rebase, since lots of flaky unit test is already fixed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for noticing and fixing this! LGTM on the principle, here are a few more comments.
ASSERT_EQ(num_values, values_decoded); | ||
ASSERT_NO_FATAL_FAILURE( | ||
VerifyResults<T>(decoded.data(), int_values.data(), num_values)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR also fixes the encoded data size, can you add a test for that? For example check that the encoded buffer size is equal to a certain value given data that would have triggered the bug, such as the data in #37939 (comment)
cpp/src/parquet/encoding.cc
Outdated
const auto bit_width = bit_width_data[i] = bit_util::NumRequiredBits( | ||
static_cast<UT>(SafeSignedSubtract(max_delta, min_delta))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that SafeSignedSubtract
simply does the substraction in the unsigned domain, so you could also write static_cast<UT>(max_delta) - static_cast<UT>(min_delta)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Nice catch
@@ -1634,6 +1634,41 @@ TYPED_TEST(TestDeltaBitPackEncoding, NonZeroPaddedMiniblockBitWidth) { | |||
} | |||
} | |||
|
|||
TYPED_TEST(TestDeltaBitPackEncoding, DeltaBitPackedWrapping) { | |||
using T = typename TypeParam::c_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment refering to the GH issue?
@etseidl Did you try to read data generated by this patch using other Parquet implementations such as parquet-mr? |
The data I posted in the issue was readable by arrow-rs and parquet-mr, but I can do a bigger test of that with more varied data. |
No need to. Thanks for the answer! |
Yes, please rebase and update the description to match the contents! |
I'm interested in performance it can bring. Verified decoding benchmark in MacOS M1Pro with RelWithDebugInfo and O2. Encoding doesn't changed a lot, for Decoding, I guess previously we tent to use 32bits and 64bits, which is waste of space but a benefit for decoding. Smaller size would just make decoding a bit smaller. However I think we should merge this patch, it can make EncodeBefore:
After:
DecodeBefore:
After:
|
b981be6
to
19b304b
Compare
I've changed the PR description and will merge if CI passes. |
The compression ratio in the DELTA_BYTE_ARRAY benchmarks is now much better as well. |
Yeah, but I guess you mean |
There are 2 DELTA_BINARY_PACKED streams in DELTA_BYTE_ARRAY, so if the deltas are small and varying +/-, this could still be a big benefit. In fact, I discovered this problem while implementing a DELTA_LENGTH_BYTE_ARRAY decoder. :) |
Thanks a lot for this @etseidl . It was embarassing not to get any space-saving benefits from the encoding... |
Ooops, I forgot that a bit. It should have a length. So this can benefits severo encodings |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 5514b22. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 40 possible false positives for unstable benchmarks that are known to sometimes produce them. |
lol, I tried this in my own dataset, the page size after encoding becomes smaller, but the size after compression even glows twice than before😂 |
Oh no 😮 |
I generate data using the code below, and use zstd default to compress it with 10000 values.
The plain size is 40000, after compression it's about 15000bytes. The delta bit pack with previous code is a bit greater than 40000, but after compression, it's 10000bytes. And after change, the size before compression is 28000bytes, after compression is 27000bytes. However I think the patch is great, at least we fix a bad problem here...🤔 |
@mapleFU Interesting, thank you. I think this shows that the DELTA encodings should be used with care, only if the data is very well-suited to them (for example integers with a small range of values). Otherwise, generic and fast compressors such as Lz4 and Zstd are probably a better choice. |
Sorry for the late reply due to a long holiday vacation. IIRC, DELTA_BINARY_PACKED encoding was inspired by FastPFor. However, FastPFor was designed mainly for positive integers. DELTA_BINARY_PACKED encoding even does not follow what FastPFor does for exception numbers (like the negative number exemplified by @mapleFU ). |
…en encoding DELTA_BINARY_PACKED (apache#37940) Closes apache#37939. ### What changes are included in this PR? This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain. Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common). ### Are these changes tested? I've included two tests that result in overflow. ### Are there any user-facing changes? No * Closes: apache#37939 Authored-by: seidl <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…en encoding DELTA_BINARY_PACKED (apache#37940) Closes apache#37939. ### What changes are included in this PR? This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain. Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common). ### Are these changes tested? I've included two tests that result in overflow. ### Are there any user-facing changes? No * Closes: apache#37939 Authored-by: seidl <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…en encoding DELTA_BINARY_PACKED (apache#37940) Closes apache#37939. ### What changes are included in this PR? This PR changes values used in the `DELTA_BINARY_PACKED` encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain. Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common). ### Are these changes tested? I've included two tests that result in overflow. ### Are there any user-facing changes? No * Closes: apache#37939 Authored-by: seidl <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Closes #37939.
What changes are included in this PR?
This PR changes values used in the
DELTA_BINARY_PACKED
encoder to signed types. To gracefully handle overflow, arithmetic is still performed in the unsigned domain, but other operations such as computing the min and max deltas are done in the signed domain.Using signed types ensures the optimal number of bits is used when encoding the deltas, which was not the case before if any negative deltas were encountered (which is obviously common).
Are these changes tested?
I've included two tests that result in overflow.
Are there any user-facing changes?
No