Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timestamp truncation/overflow bugs in orc/parquet #9382

Merged
merged 10 commits into from
Oct 7, 2021

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Oct 6, 2021

Closes #9365

This PR gets rid of integer overflow issues along with the clock rate logic by directly operating on timestamp type id. It also fixes a truncation bug in Parquet. Corresponding unit tests are added.

@PointKernel PointKernel added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 6, 2021
@PointKernel PointKernel requested review from vuule and a team October 6, 2021 00:06
@PointKernel PointKernel self-assigned this Oct 6, 2021
@PointKernel PointKernel requested review from ttnghia and removed request for a team October 6, 2021 00:06
@PointKernel PointKernel requested a review from a team as a code owner October 6, 2021 00:15
@PointKernel
Copy link
Member Author

By working on this PR, I realized clock rate logic should not be there if we use chrono properly. I will create a separate PR to make parquet get rid of the clock rate logic as well.

@codecov
Copy link

codecov bot commented Oct 6, 2021

Codecov Report

Merging #9382 (3abc032) into branch-21.12 (ab4bfaa) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9382      +/-   ##
================================================
- Coverage         10.79%   10.75%   -0.04%     
================================================
  Files               116      116              
  Lines             18869    19482     +613     
================================================
+ Hits               2036     2096      +60     
- Misses            16833    17386     +553     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/timedelta.py 0.00% <0.00%> (ø)
... and 77 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56edd42...3abc032. Read the comment docs.

@jlowe
Copy link
Member

jlowe commented Oct 6, 2021

I tried out the diff in this PR locally, but the RAPIDS Accelerator integration tests for ORC reading are still failing, so something else must be amiss as well.

Attached is a sample ORC file I saved off that was generated from one of our tests. Here's an excerpt from the Spark shell session showing what the CPU expects and what we're getting from the GPU reader instead for the date and timestamp columns (columns _c8 and _c9) in the file.

part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc.gz

CPU:

scala> spark.read.orc("/tmp/ORC_DATA/part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc").select("_c8", "_c9").show(truncate=false)
+----------+-----------------------+
|_c8       |_c9                    |
+----------+-----------------------+
|3272-03-01|null                   |
|8200-07-22|3279-03-10 07:07:00.82 |
|8331-09-28|3288-02-15 07:59:25.442|
|4125-12-02|4714-05-27 21:58:16.447|
|7123-02-13|7596-06-30 22:18:53.293|
|2162-07-16|5292-11-25 23:32:30.557|
|1840-11-12|8724-02-16 17:18:00.92 |
|6590-06-07|3180-04-16 12:44:33    |
|7115-08-07|8706-10-05 02:05:56.617|
|9651-07-18|5618-02-24 06:40:53.714|
|4952-03-01|4460-08-21 23:10:31.63 |
|5063-02-16|4557-05-01 22:44:42.203|
|2799-02-22|4682-03-11 21:03:18.361|
|1896-02-09|null                   |
|5584-11-22|8895-03-05 11:16:47.691|
|8260-04-01|9596-12-08 21:11:00.822|
|3357-01-31|9869-09-11 17:19:06.272|
|4536-06-01|6777-10-15 21:46:28.186|
|9264-01-08|5805-12-26 17:37:39.004|
|9484-08-29|9914-01-25 07:48:47.401|
+----------+-----------------------+
only showing top 20 rows

GPU:

scala> spark.read.orc("/tmp/ORC_DATA/part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc").select("_c8", "_c9").show(truncate=false)
+----------+--------------------------+
|_c8       |_c9                       |
+----------+--------------------------+
|3272-03-01|null                      |
|8200-07-22|2110-01-30 07:57:53.400896|
|8331-09-28|2119-01-07 08:50:18.022896|
|4125-12-02|1791-08-19 00:05:27.899242|
|7123-02-13|1750-12-16 02:33:16.197484|
|2162-07-16|1785-07-31 02:05:08.299691|
|1840-11-12|1709-06-23 22:23:16.405381|
|6590-06-07|2011-03-08 13:35:25.580896|
|7115-08-07|1692-02-09 07:11:12.102381|
|9651-07-18|2110-10-29 09:13:31.45669 |
|4952-03-01|2122-06-05 00:52:16.791793|
|5063-02-16|2219-02-13 00:26:27.364793|
|2799-02-22|1759-06-03 23:10:29.813242|
|1896-02-09|null                      |
|5584-11-22|1880-07-11 16:22:03.176381|
|8260-04-01|1997-09-26 02:41:42.597828|
|3357-01-31|1685-12-08 23:15:14.338278|
|4536-06-01|2101-05-11 01:09:58.509587|
|9264-01-08|1714-02-08 20:35:43.037139|
|9484-08-29|1730-04-23 13:44:55.467278|
+----------+--------------------------+
only showing top 20 rows

We've also been seeing issues with timestamps in Parquet in the RAPIDS Accelerator tests, and I verified that reverting #9278 in my local cudf repo fixes the test failures for both ORC and Parquet.

@PointKernel
Copy link
Member Author

Right, it's clear that all ORC failures are due to the integer overflow issue cause our timestamps are essentially int64_t thus cannot handle large timestamps like 4000 years in nanoseconds:

4000 * 365 * 24 * 60 * 60 * 1000000000 = 1.26144e+20

which is larger than int64_t max value: 9.223e+18

Just to make sure, are you still using nanoseconds as timestamp types when testing part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc?

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
@jlowe
Copy link
Member

jlowe commented Oct 6, 2021

Just to make sure, are you still using nanoseconds as timestamp types when testing part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc?

The RAPIDS Accelerator always requests timestamps be read in as TIMESTAMP_MICROSECONDS as that matches how Spark tracks timestamps internally.

@PointKernel
Copy link
Member Author

PointKernel commented Oct 6, 2021

@jlowe Removing #9278 commit from branch-21.12 will not change the GPU loading result on my end. It's still the same as:

+----------+--------------------------+
|_c8       |_c9                       |
+----------+--------------------------+
|3272-03-01|null                      |
|8200-07-22|2110-01-30 07:57:53.400896|
|8331-09-28|2119-01-07 08:50:18.022896|
|4125-12-02|1791-08-19 00:05:27.899242|
|7123-02-13|1750-12-16 02:33:16.197484|
|2162-07-16|1785-07-31 02:05:08.299691|
|1840-11-12|1709-06-23 22:23:16.405381|
|6590-06-07|2011-03-08 13:35:25.580896|
|7115-08-07|1692-02-09 07:11:12.102381|
|9651-07-18|2110-10-29 09:13:31.45669 |
|4952-03-01|2122-06-05 00:52:16.791793|
|5063-02-16|2219-02-13 00:26:27.364793|
|2799-02-22|1759-06-03 23:10:29.813242|
...

Did I miss something here?

  cudf_io::orc_reader_options read_opts =
    cudf_io::orc_reader_options::builder(cudf_io::source_info{"./part.orc"});
  auto res = cudf_io::read_orc(read_opts);
  cudf::test::print(res.tbl->get_column(9).view(), std::cout, ",\n");

Output:

NULL,
2110-01-30T07:57:53Z,
2119-01-07T08:50:18Z,
1791-08-19T00:05:27Z,
1750-12-16T02:33:16Z,
1785-07-31T02:05:08Z,
1709-06-23T22:23:16Z,
2011-03-08T13:35:25Z,
1692-02-09T07:11:12Z,
2110-10-29T09:13:31Z,
2122-06-05T00:52:16Z,
2219-02-13T00:26:27Z,
1759-06-03T23:10:29Z,

@PointKernel PointKernel changed the title Fix orc timestamp bug Fix timestamp truncation/overflow bugs in orc/parquet Oct 6, 2021
@jlowe
Copy link
Member

jlowe commented Oct 6, 2021

Did I miss something here?

I double-checked, and reverting #9278 from 21.12 fixes the GPU load of the file I attached. Here's the output from a GPU load on Spark using the RAPIDS Accelerator plugin from a libcudf 21.12 build and that PR reverted:

scala> spark.read.orc("/home/jlowe/delmee/669820/ORC_DATA/part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc").select("_c8", "_c9").show(truncate=false)
+----------+-----------------------+                                            
|_c8       |_c9                    |
+----------+-----------------------+
|3272-03-01|null                   |
|8200-07-22|3279-03-10 07:07:00.82 |
|8331-09-28|3288-02-15 07:59:25.442|
|4125-12-02|4714-05-27 21:58:16.447|
|7123-02-13|7596-06-30 22:02:53.293|
|2162-07-16|5292-11-25 23:32:30.557|
|1840-11-12|8724-02-16 17:02:00.92 |
|6590-06-07|3180-04-16 12:44:33    |
|7115-08-07|8706-10-05 02:49:56.617|
|9651-07-18|5618-02-24 06:40:53.714|
|4952-03-01|4460-08-21 23:10:31.63 |
|5063-02-16|4557-05-01 22:44:42.203|
|2799-02-22|4682-03-11 21:03:18.361|
|1896-02-09|null                   |
|5584-11-22|8895-03-05 11:00:47.691|
|8260-04-01|9596-12-08 21:55:00.822|
|3357-01-31|9869-09-11 17:03:06.272|
|4536-06-01|6777-10-15 21:30:28.186|
|9264-01-08|5805-12-26 17:37:39.004|
|9484-08-29|9914-01-25 07:32:47.401|
+----------+-----------------------+
only showing top 20 rows

@jlowe
Copy link
Member

jlowe commented Oct 6, 2021

I tried the latest changes on this PR, and they fix the Spark ORC and Parquet integration tests that we saw failing before. Thanks, @PointKernel!

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good, just a few comments on the tests.

cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/orc_test.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, would be good to add the issue # to the comment for easier tracking

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8203d3d into rapidsai:branch-21.12 Oct 7, 2021
@PointKernel PointKernel deleted the fix-orc-timestamp-bug branch November 4, 2021 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ORC timestamps loaded with specified timestamp type are corrupted
4 participants