Expand statistics support in ORC writer #13848

vuule · 2023-08-10T23:42:58Z

Description

Closes #7087, closes #13793, closes #13899

This PR adds support for several cases and statistics types:

sum statistics are included even when all elements are null (no minmax);
sum statistics are included in double stats;
minimum/maximum and minimumNanos/maximumNanos are included in timestamp stats;
hasNull field is written for all columns.
decimal statistics

Added tests for all supported stats.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2023-08-15T00:13:41Z

cpp/src/io/orc/stats_enc.cu

-  // ORC stats
-  uint64_t numberOfValues;
-  uint8_t hasNull;


wence-

One query about the python test, but approving python changes nonetheless

wence- · 2023-08-15T11:27:26Z

python/cudf/cudf/tests/test_orc.py

@@ -633,16 +633,19 @@ def test_orc_write_statistics(tmpdir, datadir, nrows, stats_freq):
    for col in gdf:
        if "minimum" in file_stats[0][col]:
            stats_min = file_stats[0][col]["minimum"]
-            actual_min = gdf[col].min()
-            assert normalized_equals(actual_min, stats_min)
+            if stats_min is not None:


Under what circumstances does read_orc_statistics now return None in these slots when it didn't before?

Great question!
The change in behavior is when a column only contains nulls. Previously we did not return any statistics for such column so we would not perform this comparison in the test. This PR changes the behavior so that statistics containing only the sum are included if there are not valid elements. So now the test need to correctly check "partial" statistics, i.e. min and max are not present, but sum is.

revans2

From what I can see it looks correct, but I am not an ORC stats expert yet.

vuule · 2023-09-01T18:29:26Z

cpp/src/io/orc/stats_enc.cu

          stats_len = pb_fldlen_common + pb_fld_hdrlen + 3 * (pb_fld_hdrlen + pb_fldlen_int64);
          break;
+        case dtype_date32:
+          stats_len = pb_fldlen_common + pb_fld_hdrlen + 2 * (pb_fld_hdrlen + pb_fldlen_int64);


date statistics don't have the sum, used to be wrongly grouped with ints

vuule · 2023-09-01T22:14:48Z

/ok to test

…fea-expand-stats

… fea-expand-stats

vuule · 2023-09-05T18:41:45Z

/ok to test

vyasr

Everything seems solid, but I don't have much background on this code. I left a few questions but they're mostly for my benefit.

vyasr · 2023-09-11T16:08:02Z

cpp/src/io/parquet/page_enc.cu

@@ -1858,8 +1858,8 @@ __device__ std::pair<void const*, uint32_t> get_extremum(statistics_val const* s
    }
    case dtype_int64:
    case dtype_timestamp64:
-    case dtype_float64:
-    case dtype_decimal64: return {stats_val, sizeof(int64_t)};
+    case dtype_float64: return {stats_val, sizeof(int64_t)};


So before decimal64 was being treated like int/float, and now it's instead being treated like decimal128?

Yes. It used to use int64, now it uses int128.
the specs specify (as they do) that sum of int columns should be left out if it overflows int64_t. So we use this type and check for possible overflow.
Decimal sum does not have this limitation (saved as string, right) so we want to use the largest possible type.
Now that I'm going over this again, we could have kept the dec64 min/max at int64 and only used int128 for the sum. I don't think this detail is impactful and I like the consistency between min/max and sum for decimal types with the currently used types.

vyasr · 2023-09-11T16:08:45Z

cpp/src/io/statistics/statistics_type_identification.cuh

@@ -125,7 +125,7 @@ class extrema_type {

  using non_arithmetic_extrema_type = typename std::conditional_t<
    cudf::is_fixed_point<T>() or cudf::is_duration<T>() or cudf::is_timestamp<T>(),
-    typename std::conditional_t<std::is_same_v<T, numeric::decimal128>, __int128_t, int64_t>,
+    typename std::conditional_t<cudf::is_fixed_point<T>(), __int128_t, int64_t>,


I assume this is related to the change in page_enc.cu where decimal64 is now treated the same as decimal128? Do we need any logic to cast down results for 64 bit decimals vs 128?

Yup, using int128 for all decimal types.
We cast elements up to int128 and never cast down, AFAICT. The resulting number is converted to a string. Let me know if I misunderstood the question.

vyasr · 2023-09-11T16:16:00Z

cpp/src/io/orc/stats_enc.cu

+          auto const sum_size = fixed_point_string_size(chunks[idx].sum.d128_val, scale);
+          // common + total field length + encoded string lengths + strings
+          stats_len = pb_fldlen_common + pb_fld_hdrlen32 + 3 * (pb_fld_hdrlen + pb_fld_hdrlen32) +
+                      min_size + max_size + sum_size;


So the new decimal statistics are min, max, and sum, and now we're reserving sufficient new space for them?

Yes, the previous computation of stats_len was basically unused since we did not write the stats we left space for.
The case dtype_string: case below has the similar logic as it also stores strings. The difference is that the string lengths are known from the column, and the sum is a number, not a string.

vyasr · 2023-09-11T16:27:49Z

cpp/src/io/orc/stats_enc.cu

@@ -186,6 +205,15 @@ __device__ inline uint8_t* pb_put_fixed64(uint8_t* p, uint32_t id, void const* r
  return p + 9;
 }

+// Splits a nanosecond timestamp into milliseconds and nanoseconds
+__device__ std::pair<int64_t, int32_t> split_nanosecond_timestamp(int64_t nano_count)


Why are the milliseconds encoded as 64 bit while nanoseconds are 32 bit? Is it because >1e6 ns adds to the ms, whereas the ms can grow unbounded?

That's correct. The nanoseconds part is in [0, 999999] range.
I'm open to suggestions to improve naming/comment here, I also wasn't 100% happy with clarity.

karthikeyann · 2023-09-11T19:31:29Z

/ok to test

karthikeyann · 2023-09-13T21:01:20Z

/ok to test

karthikeyann · 2023-09-17T18:26:50Z

/ok to test

karthikeyann

Looks good to me. Nice work! closing 3 issues with 1 PR!
very minor nitpicks.

cpp/tests/io/orc_test.cpp

cpp/include/cudf/strings/detail/convert/fixed_point_to_string.cuh

cpp/src/io/orc/stats_enc.cu

…fea-expand-stats

… fea-expand-stats

vuule · 2023-09-18T16:30:07Z

/ok to test

vuule · 2023-09-18T21:10:10Z

/merge

vuule added 5 commits August 10, 2023 15:51

sum w/o minmax; double sum, string sum

b614fe8

write hasNull!

9a7988a

tests

70c3b28

style

b392376

clean up

eb6cdae

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 10, 2023

vuule self-assigned this Aug 10, 2023

vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Aug 10, 2023

vuule added 4 commits August 10, 2023 16:43

Merge branch 'branch-23.10' into fea-expand-stats

4776140

Merge branch 'branch-23.10' into fea-expand-stats

e277645

fix python tests

9fbe6c5

remove incorrect bucket stats

d833ff0

github-actions bot added the Python Affects Python cuDF API. label Aug 14, 2023

vuule added 3 commits August 14, 2023 14:28

Merge branch 'branch-23.10' into fea-expand-stats

1ca376a

remove bool column from C++ stats tests

1db56c5

Merge branch 'branch-23.10' into fea-expand-stats

950cec8

vuule commented Aug 15, 2023

View reviewed changes

vuule marked this pull request as ready for review August 15, 2023 00:14

vuule requested review from a team as code owners August 15, 2023 00:14

vuule requested review from wence-, charlesbluca, robertmaynard and davidwendt August 15, 2023 00:14

wence- approved these changes Aug 15, 2023

View reviewed changes

vuule mentioned this pull request Aug 16, 2023

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

Closed

Merge branch 'branch-23.10' into fea-expand-stats

50b67d8

revans2 approved these changes Aug 17, 2023

View reviewed changes

vuule commented Sep 1, 2023

View reviewed changes

vuule added 3 commits September 1, 2023 11:35

mostly docs

8c58c6f

style

eacb578

Merge branch 'branch-23.10' into fea-expand-stats

3afcd64

vuule added 3 commits September 5, 2023 11:13

Merge branch 'branch-23.10' of https://github.com/rapidsai/cudf into …

3746cb4

…fea-expand-stats

test fix

7808cb3

Merge branch 'fea-expand-stats' of https://github.com/vuule/cudf into…

f181df2

… fea-expand-stats

vuule marked this pull request as ready for review September 5, 2023 18:46

Merge branch 'branch-23.10' into fea-expand-stats

3c0da37

vyasr approved these changes Sep 11, 2023

View reviewed changes

karthikeyann changed the title ~~Expand statistics support in ORC writer~~ Expand statistics support in ORC writer Sep 11, 2023

karthikeyann changed the title ~~Expand statistics support in ORC writer~~ Expand statistics support in ORC writer Sep 13, 2023

Merge branch 'branch-23.10' into fea-expand-stats

3a61eee

Merge branch 'branch-23.10' into fea-expand-stats

8742633

karthikeyann approved these changes Sep 17, 2023

View reviewed changes

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved

cpp/include/cudf/strings/detail/convert/fixed_point_to_string.cuh Show resolved Hide resolved

cpp/src/io/orc/stats_enc.cu Outdated Show resolved Hide resolved

vuule added 4 commits September 18, 2023 08:59

Merge branch 'branch-23.10' of https://github.com/rapidsai/cudf into …

8920b71

…fea-expand-stats

simplify lambda

d027019

const

2136109

Merge branch 'fea-expand-stats' of https://github.com/vuule/cudf into…

62862b4

… fea-expand-stats

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 18, 2023

rapids-bot bot merged commit 2acd3df into rapidsai:branch-23.10 Sep 18, 2023

vuule deleted the fea-expand-stats branch September 18, 2023 21:10

thirtiseven mentioned this pull request Oct 25, 2023

[BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down #14325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand statistics support in ORC writer #13848

Expand statistics support in ORC writer #13848

vuule commented Aug 10, 2023 •

edited

Loading

vuule Aug 15, 2023

wence- left a comment

wence- Aug 15, 2023

vuule Aug 15, 2023

revans2 left a comment

vuule Sep 1, 2023

vuule commented Sep 1, 2023

vuule commented Sep 5, 2023

vyasr left a comment

vyasr Sep 11, 2023

vuule Sep 13, 2023

vyasr Sep 11, 2023

vuule Sep 13, 2023

vyasr Sep 11, 2023

vuule Sep 13, 2023

vyasr Sep 11, 2023

vuule Sep 13, 2023

karthikeyann commented Sep 11, 2023

karthikeyann commented Sep 13, 2023

karthikeyann commented Sep 17, 2023

karthikeyann left a comment •

edited

Loading

vuule commented Sep 18, 2023

vuule commented Sep 18, 2023

Expand statistics support in ORC writer #13848

Expand statistics support in ORC writer #13848

Conversation

vuule commented Aug 10, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Sep 1, 2023

vuule commented Sep 5, 2023

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karthikeyann commented Sep 11, 2023

karthikeyann commented Sep 13, 2023

karthikeyann commented Sep 17, 2023

karthikeyann left a comment • edited Loading

Choose a reason for hiding this comment

vuule commented Sep 18, 2023

vuule commented Sep 18, 2023

vuule commented Aug 10, 2023 •

edited

Loading

karthikeyann left a comment •

edited

Loading