Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and disable encoding for nanosecond statistics in ORC writer #14367

Merged
merged 10 commits into from
Nov 15, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 7, 2023

Description

Issue #14325

Use uint when reading/writing nano stats because nanoseconds have int32 encoding (different from both unit32 and sint32, obviously), which does not use zigzag.
sint32 uses zigzag, and unit32 does not allow negative numbers, so we can use uint since we'll never have negative nanoseconds.

Also disabled the nanoseconds because it should only be written after ORC-135; we don't write the version so readers get confused if nanoseconds are there. Planning to re-enable once we start writing the version.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Nov 7, 2023
@vuule vuule self-assigned this Nov 7, 2023
@vuule vuule changed the title Use correct encoding for nanosecond statistics in ORC writer Fix and disable encoding for nanosecond statistics in ORC writer Nov 8, 2023
@vuule vuule marked this pull request as ready for review November 9, 2023 22:32
@vuule vuule requested a review from a team as a code owner November 9, 2023 22:32
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of requests for comments, otherwise the change looks fine insofar as it implements exactly what the description promises. Based on the discussion in #14325 it looks like this fix also stops the crash, which is the primary goal for 23.12

Comment on lines 44 to 45
static constexpr int32_t DEFAULT_MIN_NANOS = 0;
static constexpr int32_t DEFAULT_MAX_NANOS = 999999;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we document why these are the min and max values? These aren't powers of 2 so I assume it's not integer size based, is it limited by some spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we upgrade the type of these constants to uint32_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These do come from specs.
Nanosecond timestamp statistics are stored at millisecond precision + the leftover nanoseconds. This is why the max is 999999, one more and you have a full millisecond.

I can derive the max from chrono. Let me know.

Should we use unsigned here? AFAIK we should avoid any uints for any arithmetic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine using the signed ints, and I don't feel the need to use chrono. Just document that we store as ms + ns so 1e6 ns is the largest you'll ever need.

Copy link
Contributor

@ttnghia ttnghia Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will we have issue with sign vs unsign comparison? No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place these are used is to compare against min_ns_remainder/max_ns_remainder, and these are signed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great.

cpp/src/io/orc/stats_enc.cu Show resolved Hide resolved
Comment on lines 186 to 187
if (s.minimum_nanos.has_value()) { --s.minimum_nanos.value(); }
if (s.maximum_nanos.has_value()) { --s.maximum_nanos.value(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using unsigned int, will this underflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nanoseconds are stored as value+1, so zero is not a valid content here. We found out the hard way about this :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to clarify that please? So we won't forget about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, approved but forgot about this 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, I haven't ;)

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 15, 2023
@vuule
Copy link
Contributor Author

vuule commented Nov 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit ab2248e into rapidsai:branch-23.12 Nov 15, 2023
65 checks passed
@vuule vuule deleted the bug-write_orc-nano-stats branch November 15, 2023 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants