-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and disable encoding for nanosecond statistics in ORC writer #14367
Fix and disable encoding for nanosecond statistics in ORC writer #14367
Conversation
…bug-write_orc-nano-stats
…bug-write_orc-nano-stats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of requests for comments, otherwise the change looks fine insofar as it implements exactly what the description promises. Based on the discussion in #14325 it looks like this fix also stops the crash, which is the primary goal for 23.12
cpp/src/io/orc/orc.hpp
Outdated
static constexpr int32_t DEFAULT_MIN_NANOS = 0; | ||
static constexpr int32_t DEFAULT_MAX_NANOS = 999999; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we document why these are the min and max values? These aren't powers of 2 so I assume it's not integer size based, is it limited by some spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we upgrade the type of these constants to uint32_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These do come from specs.
Nanosecond timestamp statistics are stored at millisecond precision + the leftover nanoseconds. This is why the max is 999999, one more and you have a full millisecond.
I can derive the max from chrono
. Let me know.
Should we use unsigned here? AFAIK we should avoid any uints for any arithmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine using the signed ints, and I don't feel the need to use chrono. Just document that we store as ms + ns so 1e6 ns is the largest you'll ever need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But will we have issue with sign vs unsign comparison? No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place these are used is to compare against min_ns_remainder
/max_ns_remainder
, and these are signed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great.
cpp/src/io/orc/orc.cpp
Outdated
if (s.minimum_nanos.has_value()) { --s.minimum_nanos.value(); } | ||
if (s.maximum_nanos.has_value()) { --s.maximum_nanos.value(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using unsigned int, will this underflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nanoseconds are stored as value+1
, so zero is not a valid content here. We found out the hard way about this :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to clarify that please? So we won't forget about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, approved but forgot about this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries, I haven't ;)
…bug-write_orc-nano-stats
/merge |
Description
Issue #14325
Use uint when reading/writing nano stats because nanoseconds have int32 encoding (different from both unit32 and sint32, obviously), which does not use zigzag.
sint32 uses zigzag, and unit32 does not allow negative numbers, so we can use uint since we'll never have negative nanoseconds.
Also disabled the nanoseconds because it should only be written after ORC-135; we don't write the version so readers get confused if nanoseconds are there. Planning to re-enable once we start writing the version.
Checklist