Support `arrow:schema` in Parquet writer to faithfully roundtrip `duration` types with Arrow #15875

mhaseeb123 · 2024-05-29T01:22:22Z

Description

This PR adds the support to construct and write base64-encoded serialized arrow:schema-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet for duration type.

Answered

Only construct and write arrow:schema if asked by the user via store_schema argument (cudf) or write_arrow_schema (libcudf). i.e. Default these variables to false otherwise.
The internal/libcudf variable name for store_schema can stay write_arrow_schema and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about.
Separate PR: int96_timestamps cannot be deprecated/removed in cuDF as Spark is actively using it. Remove INT96 timestamps in cuDF Parquet writer #15901
cuDF Parquet writer supports decimal32 and decimal64 fixed types. These are not directly supported by Arrow so we will convert decimal32/decimal64 columns to decimal128.
is_col_nullable() function moved to writer_impl_helpers.cpp along with some other helper functions.
A common convert_data_to_decimal128 can be separated out and used in writer_impl.cu and to_arrow.cu. Tracking in a separate issue. [FEA] Deduplicate convert_data_to_decimal128() function #16194

CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…haseeb123/cudf into arrow-schema-support-pq-writer

cpp/src/io/parquet/arrow_schema_writer.cpp

python/cudf/cudf/_lib/parquet.pyx

nvdbaranec

Generally looks great. Just a couple small things so far.

cpp/src/io/parquet/arrow_schema_writer.cpp

cpp/src/io/parquet/writer_impl.cu

cpp/src/io/parquet/writer_impl_helpers.cpp

lithomas1

Python changes look good to me (just 1 comment).

Might want to have @vyasr or @galipremsagar take another look though.

python/cudf/cudf/tests/test_parquet.py

…abled

vuule

great stuff!
just a few comments

cpp/src/io/parquet/writer_impl_helpers.hpp

vuule · 2024-07-08T22:35:05Z

cpp/src/io/parquet/writer_impl.cu

+    switch (column.type().id()) {
+      case type_id::DECIMAL32:
+        // Convert data to decimal128 type
+        d128_vectors.emplace_back(convert_data_to_decimal128<int32_t>(column, stream));


Merging these into a single kernel (per decimal type) would be great for performance with many decimal columns, but looks like it would not fit well into the recursive implementation.
How about a stream pool? 4-8 streams that we use in round robin order might help when we have to convert may decimal columns.

Tracking this in a separate issue #16194.

cpp/src/io/parquet/arrow_schema_writer.hpp

bdice · 2024-07-09T13:52:40Z

python/cudf/cudf/utils/ioutils.py

@@ -322,6 +322,9 @@
 output_as_binary : set, optional, default None
    If a column name is present in the set, that column will be output as
    unannotated binary, rather than the default 'UTF-8'.
+store_schema : bool, default False


Is there a reason we think this should be False by default? It seems like faithful roundtrips with Arrow would be a benefit by default. However, it seems like enabling this feature will cast / convert some data types (e.g. "days" aren't supported, and only decimal128 is supported -- if I read the rest of this PR correctly). Are those conversions potentially lossy / do they change metadata? If so, are those conversions worth documenting?

A couple of reasons why we chose to set it False by default. Actually it's great that you asked this so this will be documented here.

One arrow:schema should only be used when we want to round-trip certain col types (primarily durations for now) with arrow only. Otherwise, cudf roundtrips with itself and arrow perfectly fine.

Second, cudf still supports int96 timestamps as Spark has actively been using it. Enabling store_schema breaks Sparks existing and future workflows requiring them to set this to False whenever using int96 timestamps.

Third, like you mentioned, we can't roundtrip decimal32 and decimal64 with cuDF itself with arrow::schema by default without (losslessly) converting them to decimal128.

To summarize, things work perfectly fine without arrow::schema for the most part until we need to faithfully round-trip duration with arrow only in which case it is enabled.

@galipremsagar please feel free to add any reasons that we discussed during the cuIO standup meeting a couple weeks ago.

Thanks! This kind of information would be good to have in the docstrings!

Co-authored-by: Bradley Dice <[email protected]>

mhaseeb123 · 2024-07-09T22:07:27Z

/merge

…rip `duration` types with Arrow (rapidsai#15875)" This reverts commit 67bd366.

mhaseeb123 and others added 10 commits May 22, 2024 03:15

Python bindings + initial artifacts for arrow schema in PQ writer

7e58a7a

Add artifacts to build flatbuffers.

7351f91

Add basic artifacts to construct the field vector.

9aca785

Add artifacts for arrow schema in pq writer

de0fc40

Merge branch 'arrow-schema-support-pq-writer' of https://github.com/m…

9080d36

…haseeb123/cudf into arrow-schema-support-pq-writer

merge with upstream

497727e

Workin arrow schema builder. Need to handle nested_types and dict32

d166fe6

Handle structs and lists

7dad37b

Merge branch 'rapidsai:branch-24.08' into arrow-schema-support-pq-writer

65d2ab5

Remove unused code borrowed from arrow.

e1fc02e

mhaseeb123 self-assigned this May 29, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue labels May 29, 2024

mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function breaking Breaking change Reliability labels May 29, 2024

mhaseeb123 added 3 commits May 29, 2024 02:37

minor improvements to tests and code

44fb0ef

Code cleanup and add API docs.

e733ff1

Revert changes to types.hpp

f4a9595

mhaseeb123 commented May 29, 2024

View reviewed changes

cpp/src/io/parquet/arrow_schema_writer.cpp Show resolved Hide resolved

mhaseeb123 commented May 29, 2024

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved

mhaseeb123 and others added 4 commits May 29, 2024 03:31

Minor code and doc cleanup

ede6191

Minor fix for failing pytest

62a2684

Handle int96 timestamps.

6e448ab

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

e003d65

mhaseeb123 removed the Reliability label May 29, 2024

mhaseeb123 added 2 commits May 29, 2024 19:15

Add stats_dtype to INT64 duration columns

3c800f5

turn arrow schema off by default

f7aaaad

mhaseeb123 and others added 2 commits June 26, 2024 01:53

Minor refactor

b1e6b6f

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

cb39159

nvdbaranec reviewed Jun 27, 2024

View reviewed changes

cpp/src/io/parquet/arrow_schema_writer.cpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved

cpp/src/io/parquet/writer_impl_helpers.cpp Outdated Show resolved Hide resolved

mhaseeb123 and others added 2 commits June 27, 2024 21:19

Incorporating minor suggestions from review

c011e51

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

2d45bd0

mhaseeb123 requested review from nvdbaranec and lithomas1 July 2, 2024 18:16

lithomas1 approved these changes Jul 3, 2024

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

mhaseeb123 added 3 commits July 3, 2024 18:31

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

649a92d

Test for exception handling to_parquet with int96 and arrow schema en…

b6a54ec

…abled

Minor fix for failing pytests

bddfabe

mhaseeb123 mentioned this pull request Jul 5, 2024

[FEA] Deduplicate convert_data_to_decimal128() function #16194

Closed

galipremsagar approved these changes Jul 6, 2024

View reviewed changes

vuule reviewed Jul 8, 2024

View reviewed changes

mhaseeb123 added 2 commits July 9, 2024 01:28

Minor changes from reviewer suggestions

e9ab52f

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

ca9fc3f

mhaseeb123 requested a review from vuule July 9, 2024 02:22

vuule approved these changes Jul 9, 2024

View reviewed changes

bdice approved these changes Jul 9, 2024

View reviewed changes

mhaseeb123 and others added 3 commits July 9, 2024 11:22

Update cpp/src/io/parquet/arrow_schema_writer.hpp

9b163f7

Co-authored-by: Bradley Dice <[email protected]>

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

5017f2a

Apply clang-format

13a06ac

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Jul 9, 2024

mhaseeb123 added 2 commits July 9, 2024 22:06

Add details to store_schema docstring

1ceca42

Merge branch 'branch-24.08' into arrow-schema-support-pq-writer

db54c0b

rapids-bot bot merged commit 67bd366 into rapidsai:branch-24.08 Jul 9, 2024
80 checks passed

mhaseeb123 deleted the arrow-schema-support-pq-writer branch July 9, 2024 23:29

galipremsagar added a commit to galipremsagar/cudf that referenced this pull request Jul 31, 2024

Revert "Support arrow:schema in Parquet writer to faithfully roundt…

a539a38

…rip `duration` types with Arrow (rapidsai#15875)" This reverts commit 67bd366.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `arrow:schema` in Parquet writer to faithfully roundtrip `duration` types with Arrow #15875

Support `arrow:schema` in Parquet writer to faithfully roundtrip `duration` types with Arrow #15875

mhaseeb123 commented May 29, 2024 •

edited

Loading

nvdbaranec left a comment

lithomas1 left a comment

vuule left a comment

vuule Jul 8, 2024

mhaseeb123 Jul 9, 2024

bdice Jul 9, 2024

mhaseeb123 Jul 9, 2024 •

edited

Loading

bdice Jul 9, 2024

mhaseeb123 commented Jul 9, 2024

Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow #15875

Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow #15875

Conversation

mhaseeb123 commented May 29, 2024 • edited Loading

Description

Answered

Checklist

nvdbaranec left a comment

Choose a reason for hiding this comment

lithomas1 left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vuule Jul 8, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 9, 2024

Choose a reason for hiding this comment

bdice Jul 9, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

bdice Jul 9, 2024

Choose a reason for hiding this comment

mhaseeb123 commented Jul 9, 2024

Support `arrow:schema` in Parquet writer to faithfully roundtrip `duration` types with Arrow #15875

Support `arrow:schema` in Parquet writer to faithfully roundtrip `duration` types with Arrow #15875

mhaseeb123 commented May 29, 2024 •

edited

Loading

mhaseeb123 Jul 9, 2024 •

edited

Loading