Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove INT96 timestamps in cuDF Parquet writer #15901

Closed

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented May 31, 2024

Description

Partially closes #15847.
Closes #10438

This PR remonves INT96 timestamps from cuDF in accordance with Arrow. For backward compatibility and robustness, the Parquet reader still has the capability to read INT96 timestamps and convert to INT64. For more discussion, see #15875.

Discussion

  • Would Spark functionality or tests be affected by this change? if yes, we can postpone this PR. (@nvdbaranec)

Answered

  • Following @etseidl's suggestion, PQ reader's capability to read and convert INT96 timestamps has been reverted to allow reading older parquet files with INT96 timestamp data.

CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 self-assigned this May 31, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels May 31, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress breaking Breaking change cuIO cuIO issue Spark Functionality that helps Spark RAPIDS cudf.pandas Issues specific to cudf.pandas improvement Improvement / enhancement to an existing function and removed cudf.pandas Issues specific to cudf.pandas labels May 31, 2024
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more like "Remove INT96 timestamps" than "Deprecate" 😅. I think it's fine to remove from the write path, but think cudf should still be capable of reading it, even if it's always just turned into a timestamp_ns.

cpp/include/cudf/io/parquet_metadata.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_common.hpp Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor

etseidl commented May 31, 2024

Thanks for this (IMO long overdue) update! Looks good. 👍

@github-actions github-actions bot added the Java Affects Java cuDF API. label Jun 1, 2024
Copy link

copy-pr-bot bot commented Jun 3, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123 mhaseeb123 changed the title Deprecate INT96 timestamps in cuDF Remove INT96 timestamps in cuDF Parquet writer Jun 3, 2024
@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jun 4, 2024
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

Hi @nvdbaranec, I wanted to bring your attention to the changes in this PR (removes INT96 timestamps in cuDF) and wanted to ask if it would create any waves across Spark functionality or testing?

@mhaseeb123 mhaseeb123 added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 2 - In Progress Currently a work in progress labels Jun 4, 2024
@mhaseeb123 mhaseeb123 closed this Jun 5, 2024
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Jun 5, 2024

Closing the PR as we need INT96 timestamps for Spark.

rapids-bot bot pushed a commit that referenced this pull request Jul 9, 2024
…ation` types with Arrow (#15875)

Closes #15847

This PR adds the support to construct and write base64-encoded serialized `arrow:schema`-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet for `duration` type.

### Answered
- [x] Only construct and write `arrow:schema` if  asked by the user via `store_schema` argument (cudf) or `write_arrow_schema` (libcudf). i.e. Default these variables to `false` otherwise.
- [x] The internal/libcudf variable name for `store_schema` can stay `write_arrow_schema` and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about.
- [x] Separate PR: `int96_timestamps` cannot be deprecated/removed in cuDF as Spark is actively using it. #15901 
- [x] cuDF Parquet writer supports `decimal32` and `decimal64` [fixed types](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/io/parquet/writer_impl.cu#L561). These are not directly supported by Arrow so we will [convert](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/interop/to_arrow.cu#L155) `decimal32/decimal64` columns to `decimal128`.
- [x] `is_col_nullable()` function moved to `writer_impl_helpers.cpp` along with some other helper functions.
- [x] A common `convert_data_to_decimal128` can be separated out and used in `writer_impl.cu` and `to_arrow.cu`. Tracking in a separate issue. #16194 

CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Thomas Li (https://github.com/lithomas1)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #15875
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
2 participants