Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand statistics support in ORC writer #13848
Expand statistics support in ORC writer #13848
Changes from 33 commits
b614fe8
9a7988a
70c3b28
b392376
eb6cdae
4776140
e277645
9fbe6c5
d833ff0
1ca376a
1db56c5
950cec8
50b67d8
cb61069
96b3112
01af60b
2f35d5a
cc54019
864bdd1
fcfa662
ee1347f
d7facda
bfa0d8b
cb71541
227a9c1
23a5e14
f2f6090
64636f9
75fe574
c3b7410
e79496c
7d0e3c7
b5dbea1
7b27f56
e65310d
8c58c6f
eacb578
3afcd64
3746cb4
7808cb3
f181df2
3c0da37
3a61eee
8742633
8920b71
d027019
2136109
62862b4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date statistics don't have the sum, used to be wrongly grouped with ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the new decimal statistics are min, max, and sum, and now we're reserving sufficient new space for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous computation of
stats_len
was basically unused since we did not write the stats we left space for.The
case dtype_string:
case below has the similar logic as it also stores strings. The difference is that the string lengths are known from the column, and the sum is a number, not a string.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the milliseconds encoded as 64 bit while nanoseconds are 32 bit? Is it because >1e6 ns adds to the ms, whereas the ms can grow unbounded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. The nanoseconds part is in [0, 999999] range.
I'm open to suggestions to improve naming/comment here, I also wasn't 100% happy with clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before decimal64 was being treated like int/float, and now it's instead being treated like decimal128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It used to use int64, now it uses int128.
the specs specify (as they do) that sum of int columns should be left out if it overflows int64_t. So we use this type and check for possible overflow.
Decimal sum does not have this limitation (saved as string, right) so we want to use the largest possible type.
Now that I'm going over this again, we could have kept the dec64 min/max at int64 and only used int128 for the sum. I don't think this detail is impactful and I like the consistency between min/max and sum for decimal types with the currently used types.