-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timestamp truncation/overflow bugs in orc/parquet #9382
Fix timestamp truncation/overflow bugs in orc/parquet #9382
Conversation
By working on this PR, I realized clock rate logic should not be there if we use |
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9382 +/- ##
================================================
- Coverage 10.79% 10.75% -0.04%
================================================
Files 116 116
Lines 18869 19482 +613
================================================
+ Hits 2036 2096 +60
- Misses 16833 17386 +553
Continue to review full report at Codecov.
|
I tried out the diff in this PR locally, but the RAPIDS Accelerator integration tests for ORC reading are still failing, so something else must be amiss as well. Attached is a sample ORC file I saved off that was generated from one of our tests. Here's an excerpt from the Spark shell session showing what the CPU expects and what we're getting from the GPU reader instead for the date and timestamp columns (columns _c8 and _c9) in the file. part-00000-7b5d6ab5-263d-4e01-aa2f-2ad1b2ebf691-c000.snappy.orc.gz CPU:
GPU:
We've also been seeing issues with timestamps in Parquet in the RAPIDS Accelerator tests, and I verified that reverting #9278 in my local cudf repo fixes the test failures for both ORC and Parquet. |
Right, it's clear that all ORC failures are due to the integer overflow issue cause our timestamps are essentially
which is larger than Just to make sure, are you still using nanoseconds as timestamp types when testing |
The RAPIDS Accelerator always requests timestamps be read in as |
@jlowe Removing #9278 commit from
Did I miss something here? cudf_io::orc_reader_options read_opts =
cudf_io::orc_reader_options::builder(cudf_io::source_info{"./part.orc"});
auto res = cudf_io::read_orc(read_opts);
cudf::test::print(res.tbl->get_column(9).view(), std::cout, ",\n"); Output:
|
I double-checked, and reverting #9278 from 21.12 fixes the GPU load of the file I attached. Here's the output from a GPU load on Spark using the RAPIDS Accelerator plugin from a libcudf 21.12 build and that PR reverted:
|
I tried the latest changes on this PR, and they fix the Spark ORC and Parquet integration tests that we saw failing before. Thanks, @PointKernel! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks good, just a few comments on the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, would be good to add the issue # to the comment for easier tracking
@gpucibot merge |
Closes #9365
This PR gets rid of integer overflow issues along with the clock rate logic by directly operating on timestamp type id. It also fixes a truncation bug in Parquet. Corresponding unit tests are added.