Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand statistics support in ORC writer #13848

Merged
merged 48 commits into from
Sep 18, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Aug 10, 2023

Description

Closes #7087, closes #13793, closes #13899

This PR adds support for several cases and statistics types:

  • sum statistics are included even when all elements are null (no minmax);
  • sum statistics are included in double stats;
  • minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats;
  • hasNull field is written for all columns.
  • decimal statistics

Added tests for all supported stats.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 10, 2023
@vuule vuule self-assigned this Aug 10, 2023
@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Aug 10, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 14, 2023
Comment on lines -129 to -131
// ORC stats
uint64_t numberOfValues;
uint8_t hasNull;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was unused

@vuule vuule marked this pull request as ready for review August 15, 2023 00:14
@vuule vuule requested review from a team as code owners August 15, 2023 00:14
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One query about the python test, but approving python changes nonetheless

@@ -633,16 +633,19 @@ def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq):
for col in gdf:
if "minimum" in file_stats[0][col]:
stats_min = file_stats[0][col]["minimum"]
actual_min = gdf[col].min()
assert normalized_equals(actual_min, stats_min)
if stats_min is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances does read_orc_statistics now return None in these slots when it didn't before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question!
The change in behavior is when a column only contains nulls. Previously we did not return any statistics for such column so we would not perform this comparison in the test. This PR changes the behavior so that statistics containing only the sum are included if there are not valid elements. So now the test need to correctly check "partial" statistics, i.e. min and max are not present, but sum is.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see it looks correct, but I am not an ORC stats expert yet.

stats_len = pb_fldlen_common + pb_fld_hdrlen + 3 * (pb_fld_hdrlen + pb_fldlen_int64);
break;
case dtype_date32:
stats_len = pb_fldlen_common + pb_fld_hdrlen + 2 * (pb_fld_hdrlen + pb_fldlen_int64);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date statistics don't have the sum, used to be wrongly grouped with ints

@vuule
Copy link
Contributor Author

vuule commented Sep 1, 2023

/ok to test

@vuule
Copy link
Contributor Author

vuule commented Sep 5, 2023

/ok to test

@vuule vuule marked this pull request as ready for review September 5, 2023 18:46
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems solid, but I don't have much background on this code. I left a few questions but they're mostly for my benefit.

@@ -1858,8 +1858,8 @@ __device__ std::pair<void const*, uint32_t> get_extremum(statistics_val const* s
}
case dtype_int64:
case dtype_timestamp64:
case dtype_float64:
case dtype_decimal64: return {stats_val, sizeof(int64_t)};
case dtype_float64: return {stats_val, sizeof(int64_t)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before decimal64 was being treated like int/float, and now it's instead being treated like decimal128?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It used to use int64, now it uses int128.
the specs specify (as they do) that sum of int columns should be left out if it overflows int64_t. So we use this type and check for possible overflow.
Decimal sum does not have this limitation (saved as string, right) so we want to use the largest possible type.
Now that I'm going over this again, we could have kept the dec64 min/max at int64 and only used int128 for the sum. I don't think this detail is impactful and I like the consistency between min/max and sum for decimal types with the currently used types.

@@ -125,7 +125,7 @@ class extrema_type {

using non_arithmetic_extrema_type = typename std::conditional_t<
cudf::is_fixed_point<T>() or cudf::is_duration<T>() or cudf::is_timestamp<T>(),
typename std::conditional_t<std::is_same_v<T, numeric::decimal128>, __int128_t, int64_t>,
typename std::conditional_t<cudf::is_fixed_point<T>(), __int128_t, int64_t>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is related to the change in page_enc.cu where decimal64 is now treated the same as decimal128? Do we need any logic to cast down results for 64 bit decimals vs 128?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, using int128 for all decimal types.
We cast elements up to int128 and never cast down, AFAICT. The resulting number is converted to a string. Let me know if I misunderstood the question.

auto const sum_size = fixed_point_string_size(chunks[idx].sum.d128_val, scale);
// common + total field length + encoded string lengths + strings
stats_len = pb_fldlen_common + pb_fld_hdrlen32 + 3 * (pb_fld_hdrlen + pb_fld_hdrlen32) +
min_size + max_size + sum_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the new decimal statistics are min, max, and sum, and now we're reserving sufficient new space for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous computation of stats_len was basically unused since we did not write the stats we left space for.
The case dtype_string: case below has the similar logic as it also stores strings. The difference is that the string lengths are known from the column, and the sum is a number, not a string.

@@ -186,6 +205,15 @@ __device__ inline uint8_t* pb_put_fixed64(uint8_t* p, uint32_t id, void const* r
return p + 9;
}

// Splits a nanosecond timestamp into milliseconds and nanoseconds
__device__ std::pair<int64_t, int32_t> split_nanosecond_timestamp(int64_t nano_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the milliseconds encoded as 64 bit while nanoseconds are 32 bit? Is it because >1e6 ns adds to the ms, whereas the ms can grow unbounded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. The nanoseconds part is in [0, 999999] range.
I'm open to suggestions to improve naming/comment here, I also wasn't 100% happy with clarity.

@karthikeyann karthikeyann changed the title Expand statistics support in ORC writer Expand statistics support in ORC writer Sep 11, 2023
@karthikeyann
Copy link
Contributor

/ok to test

@karthikeyann karthikeyann changed the title Expand statistics support in ORC writer Expand statistics support in ORC writer Sep 13, 2023
@karthikeyann
Copy link
Contributor

/ok to test

@karthikeyann
Copy link
Contributor

/ok to test

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Nice work! closing 3 issues with 1 PR!
very minor nitpicks.

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/stats_enc.cu Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Sep 18, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 18, 2023
@vuule
Copy link
Contributor Author

vuule commented Sep 18, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
5 participants