-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand statistics support in ORC writer #13848
Conversation
// ORC stats | ||
uint64_t numberOfValues; | ||
uint8_t hasNull; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One query about the python test, but approving python changes nonetheless
@@ -633,16 +633,19 @@ def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq): | |||
for col in gdf: | |||
if "minimum" in file_stats[0][col]: | |||
stats_min = file_stats[0][col]["minimum"] | |||
actual_min = gdf[col].min() | |||
assert normalized_equals(actual_min, stats_min) | |||
if stats_min is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances does read_orc_statistics
now return None
in these slots when it didn't before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question!
The change in behavior is when a column only contains nulls. Previously we did not return any statistics for such column so we would not perform this comparison in the test. This PR changes the behavior so that statistics containing only the sum are included if there are not valid elements. So now the test need to correctly check "partial" statistics, i.e. min and max are not present, but sum is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see it looks correct, but I am not an ORC stats expert yet.
stats_len = pb_fldlen_common + pb_fld_hdrlen + 3 * (pb_fld_hdrlen + pb_fldlen_int64); | ||
break; | ||
case dtype_date32: | ||
stats_len = pb_fldlen_common + pb_fld_hdrlen + 2 * (pb_fld_hdrlen + pb_fldlen_int64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date statistics don't have the sum, used to be wrongly grouped with ints
/ok to test |
…fea-expand-stats
… fea-expand-stats
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything seems solid, but I don't have much background on this code. I left a few questions but they're mostly for my benefit.
@@ -1858,8 +1858,8 @@ __device__ std::pair<void const*, uint32_t> get_extremum(statistics_val const* s | |||
} | |||
case dtype_int64: | |||
case dtype_timestamp64: | |||
case dtype_float64: | |||
case dtype_decimal64: return {stats_val, sizeof(int64_t)}; | |||
case dtype_float64: return {stats_val, sizeof(int64_t)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before decimal64 was being treated like int/float, and now it's instead being treated like decimal128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It used to use int64, now it uses int128.
the specs specify (as they do) that sum of int columns should be left out if it overflows int64_t. So we use this type and check for possible overflow.
Decimal sum does not have this limitation (saved as string, right) so we want to use the largest possible type.
Now that I'm going over this again, we could have kept the dec64 min/max at int64 and only used int128 for the sum. I don't think this detail is impactful and I like the consistency between min/max and sum for decimal types with the currently used types.
@@ -125,7 +125,7 @@ class extrema_type { | |||
|
|||
using non_arithmetic_extrema_type = typename std::conditional_t< | |||
cudf::is_fixed_point<T>() or cudf::is_duration<T>() or cudf::is_timestamp<T>(), | |||
typename std::conditional_t<std::is_same_v<T, numeric::decimal128>, __int128_t, int64_t>, | |||
typename std::conditional_t<cudf::is_fixed_point<T>(), __int128_t, int64_t>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is related to the change in page_enc.cu where decimal64 is now treated the same as decimal128? Do we need any logic to cast down results for 64 bit decimals vs 128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, using int128 for all decimal types.
We cast elements up to int128 and never cast down, AFAICT. The resulting number is converted to a string. Let me know if I misunderstood the question.
auto const sum_size = fixed_point_string_size(chunks[idx].sum.d128_val, scale); | ||
// common + total field length + encoded string lengths + strings | ||
stats_len = pb_fldlen_common + pb_fld_hdrlen32 + 3 * (pb_fld_hdrlen + pb_fld_hdrlen32) + | ||
min_size + max_size + sum_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the new decimal statistics are min, max, and sum, and now we're reserving sufficient new space for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous computation of stats_len
was basically unused since we did not write the stats we left space for.
The case dtype_string:
case below has the similar logic as it also stores strings. The difference is that the string lengths are known from the column, and the sum is a number, not a string.
@@ -186,6 +205,15 @@ __device__ inline uint8_t* pb_put_fixed64(uint8_t* p, uint32_t id, void const* r | |||
return p + 9; | |||
} | |||
|
|||
// Splits a nanosecond timestamp into milliseconds and nanoseconds | |||
__device__ std::pair<int64_t, int32_t> split_nanosecond_timestamp(int64_t nano_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the milliseconds encoded as 64 bit while nanoseconds are 32 bit? Is it because >1e6 ns adds to the ms, whereas the ms can grow unbounded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. The nanoseconds part is in [0, 999999] range.
I'm open to suggestions to improve naming/comment here, I also wasn't 100% happy with clarity.
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Nice work! closing 3 issues with 1 PR!
very minor nitpicks.
/ok to test |
/merge |
Description
Closes #7087, closes #13793, closes #13899
This PR adds support for several cases and statistics types:
Added tests for all supported stats.
Checklist