Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create table_input_metadata from a table_metadata #13920

Merged
merged 27 commits into from
Aug 30, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 18, 2023

Description

When round-tripping data through cuDF (e.g. read a parquet file with read_parquet(), then write slices using the chunked_parquet_writer) column nullability information can be lost. The parquet writers will accept a table_input_metadata object as an optional parameter, and this object can be used to preserve the nullability. Creating the table_input_metadata can be a challenge, however. This PR addresses this problem by adding the ability to create a table_input_metadata using the table_metadata returned by read_parquet().

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner August 18, 2023 21:47
@rapids-bot
Copy link

rapids-bot bot commented Aug 18, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 18, 2023
@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels Aug 18, 2023
cpp/src/io/functions.cpp Outdated Show resolved Hide resolved
@vuule vuule added breaking Breaking change and removed non-breaking Non-breaking change labels Aug 22, 2023
@vuule
Copy link
Contributor

vuule commented Aug 22, 2023

/ok to test

@etseidl etseidl changed the title Create table_input_metadata from a table_with_metadata Create table_input_metadata from a table_metadata Aug 23, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, aside from a single detail.
Bit sad about column_name_info containing nullability. However, it looks like column_name_info and related types will change a lot, and this awkward naming is temporary :)

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved

switch (buffer.type.id()) {
case type_id::STRING:
if (schema.value_or(reader_column_schema{}).is_enabled_convert_binary_to_strings()) {
if (schema_info != nullptr) {
schema_info->children.push_back(column_name_info{"offsets"});
schema_info->children.push_back(column_name_info{"chars"});
schema_info->children.push_back(column_name_info{"offsets", false});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value in the ctor would clean up these uses (see other comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want these specified explicitly. At least in the past these extra children have been non-nullable. Could be convinced otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removed the nullabilty on cudf-only bits...had to change the test but output schema still seems fine

etseidl and others added 2 commits August 24, 2023 10:40
Co-authored-by: Vukasin Milovanovic <[email protected]>
@vuule
Copy link
Contributor

vuule commented Aug 24, 2023

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 29, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vuule vuule added the 3 - Ready for Review Ready for review by team label Aug 29, 2023
@vuule
Copy link
Contributor

vuule commented Aug 29, 2023

/ok to test

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very small nit related to lambda names, nothing to hold up the review. This seems very useful for round-trips.

auto const& names = metadata.schema_info;

// Create a metadata hierarchy with naming and nullability using `table_and_metadata`
std::function<column_in_metadata(column_name_info const&)> get_children =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nit picking here, but get_children doesn't just get children. It is also setting nullability on the metadata. Maybe something like process_node or set_nullability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cut and paste strikes again 😅 I like process_node

@hyperbolic2346
Copy link
Contributor

/ok to test

@vuule
Copy link
Contributor

vuule commented Aug 29, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Aug 30, 2023

/ok to test

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 30, 2023
@vuule
Copy link
Contributor

vuule commented Aug 30, 2023

/merge

@rapids-bot rapids-bot bot merged commit 8978a21 into rapidsai:branch-23.10 Aug 30, 2023
@etseidl etseidl deleted the feature/input_metadata branch August 30, 2023 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants