Remove INT96 timestamps in cuDF Parquet writer #15901

mhaseeb123 · 2024-05-31T19:13:12Z

Description

Partially closes #15847.
Closes #10438

This PR remonves INT96 timestamps from cuDF in accordance with Arrow. For backward compatibility and robustness, the Parquet reader still has the capability to read INT96 timestamps and convert to INT64. For more discussion, see #15875.

Discussion

Would Spark functionality or tests be affected by this change? if yes, we can postpone this PR. (@nvdbaranec)

Answered

Following @etseidl's suggestion, PQ reader's capability to read and convert INT96 timestamps has been reverted to allow reading older parquet files with INT96 timestamp data.

CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

etseidl

Looks more like "Remove INT96 timestamps" than "Deprecate" 😅. I think it's fine to remove from the write path, but think cudf should still be capable of reading it, even if it's always just turned into a timestamp_ns.

cpp/include/cudf/io/parquet_metadata.hpp

cpp/src/io/parquet/decode_fixed.cu

cpp/src/io/parquet/parquet_common.hpp

etseidl · 2024-05-31T23:16:08Z

Thanks for this (IMO long overdue) update! Looks good. 👍

…eb123/cudf into deprecate-int96-timestamps

copy-pr-bot · 2024-06-03T18:36:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mhaseeb123 · 2024-06-03T19:58:21Z

/ok to test

mhaseeb123 · 2024-06-04T08:40:14Z

/ok to test

mhaseeb123 · 2024-06-04T19:58:58Z

Hi @nvdbaranec, I wanted to bring your attention to the changes in this PR (removes INT96 timestamps in cuDF) and wanted to ask if it would create any waves across Spark functionality or testing?

mhaseeb123 · 2024-06-05T17:06:17Z

Closing the PR as we need INT96 timestamps for Spark.

@vuule

…ation` types with Arrow (#15875) Closes #15847 This PR adds the support to construct and write base64-encoded serialized `arrow:schema`-type IPC message to parquet file footer to allow faithfully roundtrip with Arrow via Parquet for `duration` type. ### Answered - [x] Only construct and write `arrow:schema` if asked by the user via `store_schema` argument (cudf) or `write_arrow_schema` (libcudf). i.e. Default these variables to `false` otherwise. - [x] The internal/libcudf variable name for `store_schema` can stay `write_arrow_schema` and it should be fine. This has been done to disambiguate which schema (arrow or parquet) we are talking about. - [x] Separate PR: `int96_timestamps` cannot be deprecated/removed in cuDF as Spark is actively using it. #15901 - [x] cuDF Parquet writer supports `decimal32` and `decimal64` [fixed types](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/io/parquet/writer_impl.cu#L561). These are not directly supported by Arrow so we will [convert](https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/interop/to_arrow.cu#L155) `decimal32/decimal64` columns to `decimal128`. - [x] `is_col_nullable()` function moved to `writer_impl_helpers.cpp` along with some other helper functions. - [x] A common `convert_data_to_decimal128` can be separated out and used in `writer_impl.cu` and `to_arrow.cu`. Tracking in a separate issue. #16194 CC @vuule @etseidl @nvdbaranec @GregoryKimball @galipremsagar for vis. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Thomas Li (https://github.com/lithomas1) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #15875

Deprecate INT96 timestamps

d192fc4

mhaseeb123 self-assigned this May 31, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels May 31, 2024

mhaseeb123 mentioned this pull request May 31, 2024

[FEA] Support arrow:Schema in Parquet writer for faithful roundtrip with Arrow via Parquet #15847

Closed

mhaseeb123 mentioned this pull request May 31, 2024

Support arrow:schema in Parquet writer to faithfully roundtrip duration types with Arrow #15875

Merged

9 tasks

mhaseeb123 and others added 2 commits May 31, 2024 19:20

update copyrights

38bd46f

Merge branch 'branch-24.08' into deprecate-int96-timestamps

b5b173d

etseidl reviewed May 31, 2024

View reviewed changes

cpp/include/cudf/io/parquet_metadata.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/parquet_common.hpp Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits May 31, 2024 22:58

Revert reading int96 timestamps

68f3db5

revert int96 into the PQ reader's path

85503ec

mhaseeb123 and others added 2 commits May 31, 2024 18:43

Update column stats check to take care of the index column

8148b2f

ruff format the updated test

a24fdf3

mhaseeb123 mentioned this pull request Jun 1, 2024

[BUG] Incorrect statistics for int96 timestamps in parquet #10438

Closed

mhaseeb123 and others added 3 commits June 1, 2024 01:53

remove erroneous print

a99d088

Remove erroneous store_schema argument

75ab3b7

remove int96 timestamps from cudf java tests

ccf56e0

github-actions bot added the Java Affects Java cuDF API. label Jun 1, 2024

mhaseeb123 and others added 4 commits June 2, 2024 19:40

remove int96 timestamps from java

239101f

Merge branch 'branch-24.08' into deprecate-int96-timestamps

52cf37b

fix the constructor for ColumnWriterOptions numeric and timestamp types

f16f3e3

Merge branch 'deprecate-int96-timestamps' of https://github.com/mhase…

77ab302

…eb123/cudf into deprecate-int96-timestamps

Merge branch 'branch-24.08' into deprecate-int96-timestamps

1855032

mhaseeb123 changed the title ~~Deprecate INT96 timestamps in cuDF~~ Remove INT96 timestamps in cuDF Parquet writer Jun 3, 2024

hopefully the final fix for java tests

51e785b

github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jun 4, 2024

mhaseeb123 added 5 - DO NOT MERGE Hold off on merging; see PR for details and removed 2 - In Progress Currently a work in progress labels Jun 4, 2024

mhaseeb123 closed this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove INT96 timestamps in cuDF Parquet writer #15901

Remove INT96 timestamps in cuDF Parquet writer #15901

mhaseeb123 commented May 31, 2024 •

edited

Loading

etseidl left a comment

etseidl commented May 31, 2024

copy-pr-bot bot commented Jun 3, 2024

mhaseeb123 commented Jun 3, 2024

mhaseeb123 commented Jun 4, 2024

mhaseeb123 commented Jun 4, 2024

mhaseeb123 commented Jun 5, 2024 •

edited

Loading

Remove INT96 timestamps in cuDF Parquet writer #15901

Remove INT96 timestamps in cuDF Parquet writer #15901

Conversation

mhaseeb123 commented May 31, 2024 • edited Loading

Description

Discussion

Answered

Checklist

etseidl left a comment

Choose a reason for hiding this comment

etseidl commented May 31, 2024

copy-pr-bot bot commented Jun 3, 2024

mhaseeb123 commented Jun 3, 2024

mhaseeb123 commented Jun 4, 2024

mhaseeb123 commented Jun 4, 2024

mhaseeb123 commented Jun 5, 2024 • edited Loading

mhaseeb123 commented May 31, 2024 •

edited

Loading

mhaseeb123 commented Jun 5, 2024 •

edited

Loading