Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support
arrow:schema
in Parquet writer to faithfully roundtrip `dur…
…ation` types with Arrow (rapidsai#15875) Closes rapidsai#15847 This PR adds the support to construct and write base64-encoded serialized `arrow:schema`-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet for `duration` type. ### Answered - [x] Only construct and write `arrow:schema` if asked by the user via `store_schema` argument (cudf) or `write_arrow_schema` (libcudf). i.e. Default these variables to `false` otherwise. - [x] The internal/libcudf variable name for `store_schema` can stay `write_arrow_schema` and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about. - [x] Separate PR: `int96_timestamps` cannot be deprecated/removed in cuDF as Spark is actively using it. rapidsai#15901 - [x] cuDF Parquet writer supports `decimal32` and `decimal64` [fixed types](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/io/parquet/writer_impl.cu#L561). These are not directly supported by Arrow so we will [convert](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/interop/to_arrow.cu#L155) `decimal32/decimal64` columns to `decimal128`. - [x] `is_col_nullable()` function moved to `writer_impl_helpers.cpp` along with some other helper functions. - [x] A common `convert_data_to_decimal128` can be separated out and used in `writer_impl.cu` and `to_arrow.cu`. Tracking in a separate issue. rapidsai#16194 CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Thomas Li (https://github.com/lithomas1) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: rapidsai#15875
- Loading branch information