Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Parquet writer nullable option application to single table writes #12933

Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Mar 13, 2023

Description

When writing multiple tables into a single Parquet file, users can run into issues if a column does not have nulls in the first table, but have some in other tables. The nullable member of column_in_metadata was originally added to address this and allow users to enforce nullability of columns from multiple tables. Because of this, the nullable option is only applied to chunked writes.

Recently, a different use for the option has been identified, where tables are stored into individual Parquet files, which are later read and the read tables are concatenated. Without the option to enforce nullability, Parquet files can end up with different nullabilities, i.e. different schemas, causing concatenation to fail.

This PR allows the nullable option to apply to single writes as well. The write call throws if user tried to write a column with nulls as non-nullable.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Mar 13, 2023
@vuule vuule self-assigned this Mar 13, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 13, 2023
@vuule vuule changed the title Make Parquet writer nullability option application to single table writes Make Parquet writer nullable option application to single table writes Mar 13, 2023
@vuule vuule marked this pull request as ready for review March 13, 2023 22:27
@vuule vuule requested a review from a team as a code owner March 13, 2023 22:27
@vuule vuule requested review from robertmaynard and ttnghia March 13, 2023 22:27
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior change makes sense. Is it "breaking"? Code looks good.

@vuule
Copy link
Contributor Author

vuule commented Mar 15, 2023

This behavior change makes sense. Is it "breaking"? Code looks good.

The Parquet writer used to throw when called with the nullable option, and with this PR it will work. I wouldn't call that breaking, as no user will have to adjust their code because of this change.

@vuule
Copy link
Contributor Author

vuule commented Mar 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit 6d264b2 into rapidsai:branch-23.04 Mar 15, 2023
@vuule vuule deleted the pq-single-write-force-nullable branch March 15, 2023 21:51
rapids-bot bot pushed a commit that referenced this pull request Mar 22, 2023
Requires: #12933

This PR adds `nullability` parameter to parquet writer. When it is `True`, all columns are written as `null` in the schema. When `False`, all columns are written as `not null` in the schema, however, if a column contains null values, this parameter is ignored.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #12952
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants