-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] duration - parquet read, write support #5903
[REVIEW] duration - parquet read, write support #5903
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #5903 +/- ##
============================================
Coverage 84.53% 84.54%
============================================
Files 81 81
Lines 13865 13869 +4
============================================
+ Hits 11721 11725 +4
Misses 2144 2144
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cpp/src/io/parquet/writer_impl.cu
Outdated
_physical_type = Type::INT64; | ||
_converted_type = ConvertedType::TIME_MICROS; | ||
_stats_dtype = statistics_dtype::dtype_int64; | ||
_ts_scale = -1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially a comment down below on why -1000
vs 1000
for the int32_t _ts_scale;
member.
This change looks fine to me, but from the documentation it looks like TIME_* was intended to be used for a time without a date, not a time duration. Spark uses INTERVAL for all duration values and I don't know what it would do if it encountered a TIME_*, I guess we should test this. TIME_* types according to drill are used for (TIME_MILLIS specifically) "Logical time, not including the date. Annotates int32. Number of milliseconds after midnight." https://drill.apache.org/docs/parquet-format/ The arrow code base indicate this too. However MATLAB apparently treats them like a duration https://www.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html |
I should add that I am calling this out partly because the spark code cannot use this and we will require support for INTERVAL. Spark treats INTERVAL types like a struct so perhaps once struct support is in place we can do something to support reading INTERVAL as a struct of 3 duration values and when writing we can detect the same pattern, possibly with a hint. |
Sorry I am wrong. Spark does not support storing its CalendarInterval type to parquet. There was discussion on a JIRA to support it, and the mapping is very clear, but apparently it never was actually done. So you can ignore all of my comments above. |
cuIO parquet reader, writer support for duration types.
addresses part of #5272
supports seconds, milliseconds, microseconds
supports days, nanoseconds only inside cudf. (parquet 1.0 doesn't support these types)