-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove INT96 timestamps in cuDF Parquet writer #15901
Remove INT96 timestamps in cuDF Parquet writer #15901
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more like "Remove INT96 timestamps" than "Deprecate" 😅. I think it's fine to remove from the write path, but think cudf should still be capable of reading it, even if it's always just turned into a timestamp_ns.
Thanks for this (IMO long overdue) update! Looks good. 👍 |
…eb123/cudf into deprecate-int96-timestamps
/ok to test |
/ok to test |
Hi @nvdbaranec, I wanted to bring your attention to the changes in this PR (removes INT96 timestamps in cuDF) and wanted to ask if it would create any waves across Spark functionality or testing? |
Closing the PR as we need INT96 timestamps for Spark. |
…ation` types with Arrow (#15875) Closes #15847 This PR adds the support to construct and write base64-encoded serialized `arrow:schema`-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet for `duration` type. ### Answered - [x] Only construct and write `arrow:schema` if asked by the user via `store_schema` argument (cudf) or `write_arrow_schema` (libcudf). i.e. Default these variables to `false` otherwise. - [x] The internal/libcudf variable name for `store_schema` can stay `write_arrow_schema` and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about. - [x] Separate PR: `int96_timestamps` cannot be deprecated/removed in cuDF as Spark is actively using it. #15901 - [x] cuDF Parquet writer supports `decimal32` and `decimal64` [fixed types](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/io/parquet/writer_impl.cu#L561). These are not directly supported by Arrow so we will [convert](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/interop/to_arrow.cu#L155) `decimal32/decimal64` columns to `decimal128`. - [x] `is_col_nullable()` function moved to `writer_impl_helpers.cpp` along with some other helper functions. - [x] A common `convert_data_to_decimal128` can be separated out and used in `writer_impl.cu` and `to_arrow.cu`. Tracking in a separate issue. #16194 CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Thomas Li (https://github.com/lithomas1) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #15875
Description
Partially closes #15847.
Closes #10438
This PR remonves INT96 timestamps from cuDF in accordance with Arrow. For backward compatibility and robustness, the Parquet reader still has the capability to read INT96 timestamps and convert to INT64. For more discussion, see #15875.
Discussion
Answered
CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.
Checklist