Fix and disable encoding for nanosecond statistics in ORC writer #14367

vuule · 2023-11-07T01:20:44Z

Description

Use uint when reading/writing nano stats because nanoseconds have int32 encoding (different from both unit32 and sint32, obviously), which does not use zigzag.
sint32 uses zigzag, and unit32 does not allow negative numbers, so we can use uint since we'll never have negative nanoseconds.

Also disabled the nanoseconds because it should only be written after ORC-135; we don't write the version so readers get confused if nanoseconds are there. Planning to re-enable once we start writing the version.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

into bug-write_orc-nano-stats

…bug-write_orc-nano-stats

vyasr

A couple of requests for comments, otherwise the change looks fine insofar as it implements exactly what the description promises. Based on the discussion in #14325 it looks like this fix also stops the crash, which is the primary goal for 23.12

vyasr · 2023-11-14T23:15:22Z

cpp/src/io/orc/orc.hpp

+static constexpr int32_t DEFAULT_MIN_NANOS = 0;
+static constexpr int32_t DEFAULT_MAX_NANOS = 999999;


Could we document why these are the min and max values? These aren't powers of 2 so I assume it's not integer size based, is it limited by some spec?

Why don't we upgrade the type of these constants to uint32_t?

These do come from specs.
Nanosecond timestamp statistics are stored at millisecond precision + the leftover nanoseconds. This is why the max is 999999, one more and you have a full millisecond.

I can derive the max from chrono. Let me know.

Should we use unsigned here? AFAIK we should avoid any uints for any arithmetic.

I'm fine using the signed ints, and I don't feel the need to use chrono. Just document that we store as ms + ns so 1e6 ns is the largest you'll ever need.

But will we have issue with sign vs unsign comparison? No?

The only place these are used is to compare against min_ns_remainder/max_ns_remainder, and these are signed.

cpp/src/io/orc/stats_enc.cu

ttnghia · 2023-11-14T23:33:55Z

cpp/src/io/orc/orc.cpp

+  if (s.minimum_nanos.has_value()) { --s.minimum_nanos.value(); }
+  if (s.maximum_nanos.has_value()) { --s.maximum_nanos.value(); }


Since we are using unsigned int, will this underflow?

The nanoseconds are stored as value+1, so zero is not a valid content here. We found out the hard way about this :D

Can you add a comment to clarify that please? So we won't forget about it.

Oh, approved but forgot about this 😄

no worries, I haven't ;)

…bug-write_orc-nano-stats

vuule · 2023-11-15T07:30:48Z

/merge

use uint to avoid zigzag

d14b18d

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Nov 7, 2023

vuule self-assigned this Nov 7, 2023

Merge branch 'branch-23.12' into bug-write_orc-nano-stats

ee84c45

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 7, 2023

vuule mentioned this pull request Nov 7, 2023

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

Closed

vuule added 6 commits November 7, 2023 11:23

off by one

a0741a4

Merge branch 'bug-write_orc-nano-stats' of https://github.com/vuule/cudf

d84f73a

into bug-write_orc-nano-stats

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

761c4ad

…bug-write_orc-nano-stats

disable nanoseconds

14b1e92

adjust size as well

166dee7

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

b3e3279

…bug-write_orc-nano-stats

vuule changed the title ~~Use correct encoding for nanosecond statistics in ORC writer~~ Fix and disable encoding for nanosecond statistics in ORC writer Nov 8, 2023

vuule marked this pull request as ready for review November 9, 2023 22:32

vuule requested a review from a team as a code owner November 9, 2023 22:32

vuule requested review from robertmaynard and hyperbolic2346 November 9, 2023 22:32

GregoryKimball requested review from ttnghia and shrshi November 14, 2023 19:05

vyasr approved these changes Nov 14, 2023

View reviewed changes

ttnghia reviewed Nov 14, 2023

View reviewed changes

Merge branch 'branch-23.12' of https://github.com/rapidsai/cudf into …

051e450

…bug-write_orc-nano-stats

ttnghia approved these changes Nov 15, 2023

View reviewed changes

comments and a few checks

d1b0345

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 15, 2023

rapids-bot bot merged commit ab2248e into rapidsai:branch-23.12 Nov 15, 2023
65 checks passed

vuule deleted the bug-write_orc-nano-stats branch November 15, 2023 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and disable encoding for nanosecond statistics in ORC writer #14367

Fix and disable encoding for nanosecond statistics in ORC writer #14367

vuule commented Nov 7, 2023 •

edited

Loading

vyasr left a comment

vyasr Nov 14, 2023

ttnghia Nov 14, 2023

vuule Nov 14, 2023

vyasr Nov 15, 2023

ttnghia Nov 15, 2023 •

edited

Loading

vuule Nov 15, 2023

ttnghia Nov 15, 2023

ttnghia Nov 14, 2023

vuule Nov 14, 2023

ttnghia Nov 14, 2023

ttnghia Nov 15, 2023

vuule Nov 15, 2023

vuule commented Nov 15, 2023

		static constexpr int32_t DEFAULT_MIN_NANOS = 0;
		static constexpr int32_t DEFAULT_MAX_NANOS = 999999;

		if (s.minimum_nanos.has_value()) { --s.minimum_nanos.value(); }
		if (s.maximum_nanos.has_value()) { --s.maximum_nanos.value(); }

Fix and disable encoding for nanosecond statistics in ORC writer #14367

Fix and disable encoding for nanosecond statistics in ORC writer #14367

Conversation

vuule commented Nov 7, 2023 • edited Loading

Description

Checklist

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Nov 15, 2023

vuule commented Nov 7, 2023 •

edited

Loading

ttnghia Nov 15, 2023 •

edited

Loading