-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arrow:schema
in Parquet writer to faithfully roundtrip duration
types with Arrow
#15875
Support arrow:schema
in Parquet writer to faithfully roundtrip duration
types with Arrow
#15875
Conversation
…haseeb123/cudf into arrow-schema-support-pq-writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great. Just a couple small things so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python changes look good to me (just 1 comment).
Might want to have @vyasr or @galipremsagar take another look though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff!
just a few comments
switch (column.type().id()) { | ||
case type_id::DECIMAL32: | ||
// Convert data to decimal128 type | ||
d128_vectors.emplace_back(convert_data_to_decimal128<int32_t>(column, stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging these into a single kernel (per decimal type) would be great for performance with many decimal columns, but looks like it would not fit well into the recursive implementation.
How about a stream pool? 4-8 streams that we use in round robin order might help when we have to convert may decimal columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking this in a separate issue #16194.
@@ -322,6 +322,9 @@ | |||
output_as_binary : set, optional, default None | |||
If a column name is present in the set, that column will be output as | |||
unannotated binary, rather than the default 'UTF-8'. | |||
store_schema : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we think this should be False by default? It seems like faithful roundtrips with Arrow would be a benefit by default. However, it seems like enabling this feature will cast / convert some data types (e.g. "days" aren't supported, and only decimal128 is supported -- if I read the rest of this PR correctly). Are those conversions potentially lossy / do they change metadata? If so, are those conversions worth documenting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of reasons why we chose to set it False by default. Actually it's great that you asked this so this will be documented here.
One arrow:schema
should only be used when we want to round-trip certain col types (primarily durations for now) with arrow only. Otherwise, cudf roundtrips with itself and arrow perfectly fine.
Second, cudf still supports int96 timestamps as Spark has actively been using it. Enabling store_schema
breaks Sparks existing and future workflows requiring them to set this to False whenever using int96 timestamps.
Third, like you mentioned, we can't roundtrip decimal32 and decimal64 with cuDF itself with arrow::schema by default without (losslessly) converting them to decimal128.
To summarize, things work perfectly fine without arrow::schema for the most part until we need to faithfully round-trip duration with arrow only in which case it is enabled.
@galipremsagar please feel free to add any reasons that we discussed during the cuIO standup meeting a couple weeks ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This kind of information would be good to have in the docstrings!
/merge |
…rip `duration` types with Arrow (rapidsai#15875)" This reverts commit 67bd366.
Description
Closes #15847
This PR adds the support to construct and write base64-encoded serialized
arrow:schema
-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet forduration
type.Answered
arrow:schema
if asked by the user viastore_schema
argument (cudf) orwrite_arrow_schema
(libcudf). i.e. Default these variables tofalse
otherwise.store_schema
can staywrite_arrow_schema
and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about.int96_timestamps
cannot be deprecated/removed in cuDF as Spark is actively using it. Remove INT96 timestamps in cuDF Parquet writer #15901decimal32
anddecimal64
fixed types. These are not directly supported by Arrow so we will convertdecimal32/decimal64
columns todecimal128
.is_col_nullable()
function moved towriter_impl_helpers.cpp
along with some other helper functions.convert_data_to_decimal128
can be separated out and used inwriter_impl.cu
andto_arrow.cu
. Tracking in a separate issue. [FEA] Deduplicateconvert_data_to_decimal128()
function #16194CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.
Checklist