Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] duration - parquet read, write support #5903

Merged
merged 7 commits into from
Aug 17, 2020

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Aug 10, 2020

cuIO parquet reader, writer support for duration types.
addresses part of #5272

supports seconds, milliseconds, microseconds
supports days, nanoseconds only inside cudf. (parquet 1.0 doesn't support these types)

@karthikeyann karthikeyann requested a review from a team as a code owner August 10, 2020 16:44
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@karthikeyann karthikeyann added the 2 - In Progress Currently a work in progress label Aug 10, 2020
@codecov
Copy link

codecov bot commented Aug 10, 2020

Codecov Report

Merging #5903 into branch-0.15 will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.15    #5903   +/-   ##
============================================
  Coverage        84.53%   84.54%           
============================================
  Files               81       81           
  Lines            13865    13869    +4     
============================================
+ Hits             11721    11725    +4     
  Misses            2144     2144           
Impacted Files Coverage Δ
python/cudf/cudf/core/column/timedelta.py 88.40% <0.00%> (+0.09%) ⬆️
python/cudf/cudf/core/column/datetime.py 88.38% <0.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b69b25b...dc75423. Read the comment docs.

@karthikeyann karthikeyann changed the title [WIP] duration - parquet read, write [REVIEW] duration - parquet read, write Aug 11, 2020
@karthikeyann karthikeyann changed the title [REVIEW] duration - parquet read, write [REVIEW] duration - parquet read, write support Aug 11, 2020
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Aug 11, 2020
Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

_physical_type = Type::INT64;
_converted_type = ConvertedType::TIME_MICROS;
_stats_dtype = statistics_dtype::dtype_int64;
_ts_scale = -1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially a comment down below on why -1000 vs 1000 for the int32_t _ts_scale; member.

@revans2
Copy link
Contributor

revans2 commented Aug 12, 2020

This change looks fine to me, but from the documentation it looks like TIME_* was intended to be used for a time without a date, not a time duration.

https://github.com/apache/parquet-format/blob/345282ce307a9c9dcc15b6f3c9106be379ec26ba/LogicalTypes.md#time

Spark uses INTERVAL for all duration values and I don't know what it would do if it encountered a TIME_*, I guess we should test this.
https://github.com/apache/parquet-format/blob/345282ce307a9c9dcc15b6f3c9106be379ec26ba/LogicalTypes.md#interval

TIME_* types according to drill are used for (TIME_MILLIS specifically) "Logical time, not including the date. Annotates int32. Number of milliseconds after midnight." https://drill.apache.org/docs/parquet-format/

The arrow code base indicate this too.
https://github.com/apache/arrow/blob/e2f877cab97d6e0278a91ecd5cd661f3011c7abc/rust/parquet/src/basic.rs#L91-L93

However MATLAB apparently treats them like a duration https://www.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html

@revans2
Copy link
Contributor

revans2 commented Aug 12, 2020

I should add that I am calling this out partly because the spark code cannot use this and we will require support for INTERVAL. Spark treats INTERVAL types like a struct so perhaps once struct support is in place we can do something to support reading INTERVAL as a struct of 3 duration values and when writing we can detect the same pattern, possibly with a hint.

@revans2
Copy link
Contributor

revans2 commented Aug 13, 2020

Sorry I am wrong. Spark does not support storing its CalendarInterval type to parquet. There was discussion on a JIRA to support it, and the mapping is very clear, but apparently it never was actually done. So you can ignore all of my comments above.

@harrism harrism merged commit 28216fd into rapidsai:branch-0.15 Aug 17, 2020
@karthikeyann karthikeyann mentioned this pull request Aug 21, 2020
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants