Fix timestamp truncation/overflow bugs in orc/parquet #9382

PointKernel · 2021-10-06T00:06:15Z

This PR gets rid of integer overflow issues along with the clock rate logic by directly operating on timestamp type id. It also fixes a truncation bug in Parquet. Corresponding unit tests are added.

PointKernel · 2021-10-06T00:39:51Z

By working on this PR, I realized clock rate logic should not be there if we use chrono properly. I will create a separate PR to make parquet get rid of the clock rate logic as well.

codecov · 2021-10-06T01:36:14Z

Codecov Report

Merging #9382 (3abc032) into branch-21.12 (ab4bfaa) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

@@               Coverage Diff                @@
##           branch-21.12    #9382      +/-   ##
================================================
- Coverage         10.79%   10.75%   -0.04%     
================================================
  Files               116      116              
  Lines             18869    19482     +613     
================================================
+ Hits               2036     2096      +60     
- Misses            16833    17386     +553

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_lib/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/categorical.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/datetime.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/lists.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/numerical.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/string.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/timedelta.py	`0.00% <0.00%> (ø)`
... and 77 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56edd42...3abc032. Read the comment docs.

jlowe · 2021-10-06T02:25:21Z

I tried out the diff in this PR locally, but the RAPIDS Accelerator integration tests for ORC reading are still failing, so something else must be amiss as well.

Attached is a sample ORC file I saved off that was generated from one of our tests. Here's an excerpt from the Spark shell session showing what the CPU expects and what we're getting from the GPU reader instead for the date and timestamp columns (columns _c8 and _c9) in the file.

part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc.gz

CPU:

scala> spark.read.orc("/tmp/ORC_DATA/part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc").select("_c8", "_c9").show(truncate=false)
+----------+-----------------------+
|_c8       |_c9                    |
+----------+-----------------------+
|3272-03-01|null                   |
|8200-07-22|3279-03-10 07:07:00.82 |
|8331-09-28|3288-02-15 07:59:25.442|
|4125-12-02|4714-05-27 21:58:16.447|
|7123-02-13|7596-06-30 22:18:53.293|
|2162-07-16|5292-11-25 23:32:30.557|
|1840-11-12|8724-02-16 17:18:00.92 |
|6590-06-07|3180-04-16 12:44:33    |
|7115-08-07|8706-10-05 02:05:56.617|
|9651-07-18|5618-02-24 06:40:53.714|
|4952-03-01|4460-08-21 23:10:31.63 |
|5063-02-16|4557-05-01 22:44:42.203|
|2799-02-22|4682-03-11 21:03:18.361|
|1896-02-09|null                   |
|5584-11-22|8895-03-05 11:16:47.691|
|8260-04-01|9596-12-08 21:11:00.822|
|3357-01-31|9869-09-11 17:19:06.272|
|4536-06-01|6777-10-15 21:46:28.186|
|9264-01-08|5805-12-26 17:37:39.004|
|9484-08-29|9914-01-25 07:48:47.401|
+----------+-----------------------+
only showing top 20 rows

GPU:

scala> spark.read.orc("/tmp/ORC_DATA/part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc").select("_c8", "_c9").show(truncate=false)
+----------+--------------------------+
|_c8       |_c9                       |
+----------+--------------------------+
|3272-03-01|null                      |
|8200-07-22|2110-01-30 07:57:53.400896|
|8331-09-28|2119-01-07 08:50:18.022896|
|4125-12-02|1791-08-19 00:05:27.899242|
|7123-02-13|1750-12-16 02:33:16.197484|
|2162-07-16|1785-07-31 02:05:08.299691|
|1840-11-12|1709-06-23 22:23:16.405381|
|6590-06-07|2011-03-08 13:35:25.580896|
|7115-08-07|1692-02-09 07:11:12.102381|
|9651-07-18|2110-10-29 09:13:31.45669 |
|4952-03-01|2122-06-05 00:52:16.791793|
|5063-02-16|2219-02-13 00:26:27.364793|
|2799-02-22|1759-06-03 23:10:29.813242|
|1896-02-09|null                      |
|5584-11-22|1880-07-11 16:22:03.176381|
|8260-04-01|1997-09-26 02:41:42.597828|
|3357-01-31|1685-12-08 23:15:14.338278|
|4536-06-01|2101-05-11 01:09:58.509587|
|9264-01-08|1714-02-08 20:35:43.037139|
|9484-08-29|1730-04-23 13:44:55.467278|
+----------+--------------------------+
only showing top 20 rows

We've also been seeing issues with timestamps in Parquet in the RAPIDS Accelerator tests, and I verified that reverting #9278 in my local cudf repo fixes the test failures for both ORC and Parquet.

PointKernel · 2021-10-06T03:14:34Z

Right, it's clear that all ORC failures are due to the integer overflow issue cause our timestamps are essentially int64_t thus cannot handle large timestamps like 4000 years in nanoseconds:

4000 * 365 * 24 * 60 * 60 * 1000000000 = 1.26144e+20

which is larger than int64_t max value: 9.223e+18

Just to make sure, are you still using nanoseconds as timestamp types when testing part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc?

cpp/src/io/orc/stripe_data.cu

cpp/tests/io/orc_test.cpp

cpp/src/io/orc/stripe_data.cu

jlowe · 2021-10-06T12:14:02Z

Just to make sure, are you still using nanoseconds as timestamp types when testing part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc?

The RAPIDS Accelerator always requests timestamps be read in as TIMESTAMP_MICROSECONDS as that matches how Spark tracks timestamps internally.

PointKernel · 2021-10-06T15:31:24Z

@jlowe Removing #9278 commit from branch-21.12 will not change the GPU loading result on my end. It's still the same as:

+----------+--------------------------+
|_c8       |_c9                       |
+----------+--------------------------+
|3272-03-01|null                      |
|8200-07-22|2110-01-30 07:57:53.400896|
|8331-09-28|2119-01-07 08:50:18.022896|
|4125-12-02|1791-08-19 00:05:27.899242|
|7123-02-13|1750-12-16 02:33:16.197484|
|2162-07-16|1785-07-31 02:05:08.299691|
|1840-11-12|1709-06-23 22:23:16.405381|
|6590-06-07|2011-03-08 13:35:25.580896|
|7115-08-07|1692-02-09 07:11:12.102381|
|9651-07-18|2110-10-29 09:13:31.45669 |
|4952-03-01|2122-06-05 00:52:16.791793|
|5063-02-16|2219-02-13 00:26:27.364793|
|2799-02-22|1759-06-03 23:10:29.813242|
...

Did I miss something here?

  cudf_io::orc_reader_options read_opts =
    cudf_io::orc_reader_options::builder(cudf_io::source_info{"./part.orc"});
  auto res = cudf_io::read_orc(read_opts);
  cudf::test::print(res.tbl->get_column(9).view(), std::cout, ",\n");

Output:

NULL,
2110-01-30T07:57:53Z,
2119-01-07T08:50:18Z,
1791-08-19T00:05:27Z,
1750-12-16T02:33:16Z,
1785-07-31T02:05:08Z,
1709-06-23T22:23:16Z,
2011-03-08T13:35:25Z,
1692-02-09T07:11:12Z,
2110-10-29T09:13:31Z,
2122-06-05T00:52:16Z,
2219-02-13T00:26:27Z,
1759-06-03T23:10:29Z,

jlowe · 2021-10-06T17:22:46Z

Did I miss something here?

I double-checked, and reverting #9278 from 21.12 fixes the GPU load of the file I attached. Here's the output from a GPU load on Spark using the RAPIDS Accelerator plugin from a libcudf 21.12 build and that PR reverted:

scala> spark.read.orc("/home/jlowe/delmee/669820/ORC_DATA/part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc").select("_c8", "_c9").show(truncate=false)
+----------+-----------------------+                                            
|_c8       |_c9                    |
+----------+-----------------------+
|3272-03-01|null                   |
|8200-07-22|3279-03-10 07:07:00.82 |
|8331-09-28|3288-02-15 07:59:25.442|
|4125-12-02|4714-05-27 21:58:16.447|
|7123-02-13|7596-06-30 22:02:53.293|
|2162-07-16|5292-11-25 23:32:30.557|
|1840-11-12|8724-02-16 17:02:00.92 |
|6590-06-07|3180-04-16 12:44:33    |
|7115-08-07|8706-10-05 02:49:56.617|
|9651-07-18|5618-02-24 06:40:53.714|
|4952-03-01|4460-08-21 23:10:31.63 |
|5063-02-16|4557-05-01 22:44:42.203|
|2799-02-22|4682-03-11 21:03:18.361|
|1896-02-09|null                   |
|5584-11-22|8895-03-05 11:00:47.691|
|8260-04-01|9596-12-08 21:55:00.822|
|3357-01-31|9869-09-11 17:03:06.272|
|4536-06-01|6777-10-15 21:30:28.186|
|9264-01-08|5805-12-26 17:37:39.004|
|9484-08-29|9914-01-25 07:32:47.401|
+----------+-----------------------+
only showing top 20 rows

jlowe · 2021-10-06T18:06:53Z

I tried the latest changes on this PR, and they fix the Spark ORC and Parquet integration tests that we saw failing before. Thanks, @PointKernel!

vuule

Fix looks good, just a few comments on the tests.

cpp/tests/io/orc_test.cpp

cpp/src/io/parquet/page_data.cu

vuule

looks good, would be good to add the issue # to the comment for easier tracking

cpp/tests/io/parquet_test.cpp

PointKernel · 2021-10-07T23:51:53Z

@gpucibot merge

PointKernel added 3 commits October 5, 2021 17:05

Add timestamp orc reader tests

9012c51

Cleanups: get rid of clock rate logic in orc

40f4f31

Update docs

88b7ca8

PointKernel added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 6, 2021

PointKernel requested review from vuule and a team October 6, 2021 00:06

PointKernel self-assigned this Oct 6, 2021

PointKernel requested review from ttnghia and removed request for a team October 6, 2021 00:06

Code formatting

c4882e5

PointKernel requested a review from a team as a code owner October 6, 2021 00:15

jrhemstad reviewed Oct 6, 2021

View reviewed changes

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

vuule reviewed Oct 6, 2021

View reviewed changes

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved

vuule reviewed Oct 6, 2021

View reviewed changes

cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved

Cast to desired precision first then sum

1cf25bf

Fix a timestamp truncation bug in parquet

ee31b77

PointKernel changed the title ~~Fix orc timestamp bug~~ Fix timestamp truncation/overflow bugs in orc/parquet Oct 6, 2021

vuule requested changes Oct 6, 2021

View reviewed changes

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved

jrhemstad reviewed Oct 6, 2021

View reviewed changes

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved

PointKernel added 3 commits October 6, 2021 16:14

Use IIFE properly

614abcf

Use int64_t max to test potential overflow issues

06fee9c

Add parquet timestamp overflow tests

ccf0092

PointKernel mentioned this pull request Oct 6, 2021

[BUG] Parquet reader produces incorrect timestamps #9393

Closed

jlowe mentioned this pull request Oct 7, 2021

Temporarily disable timestamp read tests for Parquet and ORC NVIDIA/spark-rapids#3758

Merged

vuule approved these changes Oct 7, 2021

View reviewed changes

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved

Update comments: add issue number for easier tracking

3abc032

rapids-bot bot merged commit 8203d3d into rapidsai:branch-21.12 Oct 7, 2021

PointKernel deleted the fix-orc-timestamp-bug branch November 4, 2021 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timestamp truncation/overflow bugs in orc/parquet #9382

Fix timestamp truncation/overflow bugs in orc/parquet #9382

PointKernel commented Oct 6, 2021 •

edited

Loading

PointKernel commented Oct 6, 2021

codecov bot commented Oct 6, 2021 •

edited

Loading

jlowe commented Oct 6, 2021

PointKernel commented Oct 6, 2021

jlowe commented Oct 6, 2021

PointKernel commented Oct 6, 2021 •

edited

Loading

jlowe commented Oct 6, 2021

jlowe commented Oct 6, 2021

vuule left a comment

vuule left a comment

PointKernel commented Oct 7, 2021

Fix timestamp truncation/overflow bugs in orc/parquet #9382

Fix timestamp truncation/overflow bugs in orc/parquet #9382

Conversation

PointKernel commented Oct 6, 2021 • edited Loading

PointKernel commented Oct 6, 2021

codecov bot commented Oct 6, 2021 • edited Loading

Codecov Report

jlowe commented Oct 6, 2021

PointKernel commented Oct 6, 2021

jlowe commented Oct 6, 2021

PointKernel commented Oct 6, 2021 • edited Loading

jlowe commented Oct 6, 2021

jlowe commented Oct 6, 2021

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

PointKernel commented Oct 7, 2021

PointKernel commented Oct 6, 2021 •

edited

Loading

codecov bot commented Oct 6, 2021 •

edited

Loading

PointKernel commented Oct 6, 2021 •

edited

Loading