-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create table_input_metadata from a table_metadata #13920
Create table_input_metadata from a table_metadata #13920
Conversation
Pull requests from external contributors require approval from a |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, aside from a single detail.
Bit sad about column_name_info
containing nullability. However, it looks like column_name_info
and related types will change a lot, and this awkward naming is temporary :)
|
||
switch (buffer.type.id()) { | ||
case type_id::STRING: | ||
if (schema.value_or(reader_column_schema{}).is_enabled_convert_binary_to_strings()) { | ||
if (schema_info != nullptr) { | ||
schema_info->children.push_back(column_name_info{"offsets"}); | ||
schema_info->children.push_back(column_name_info{"chars"}); | ||
schema_info->children.push_back(column_name_info{"offsets", false}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default value in the ctor would clean up these uses (see other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want these specified explicitly. At least in the past these extra children have been non-nullable. Could be convinced otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, removed the nullabilty on cudf-only bits...had to change the test but output schema still seems fine
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very small nit related to lambda names, nothing to hold up the review. This seems very useful for round-trips.
cpp/src/io/functions.cpp
Outdated
auto const& names = metadata.schema_info; | ||
|
||
// Create a metadata hierarchy with naming and nullability using `table_and_metadata` | ||
std::function<column_in_metadata(column_name_info const&)> get_children = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nit picking here, but get_children
doesn't just get children. It is also setting nullability on the metadata. Maybe something like process_node
or set_nullability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cut and paste strikes again 😅 I like process_node
/ok to test |
/ok to test |
/ok to test |
/merge |
Description
When round-tripping data through cuDF (e.g. read a parquet file with
read_parquet()
, then write slices using thechunked_parquet_writer
) column nullability information can be lost. The parquet writers will accept atable_input_metadata
object as an optional parameter, and this object can be used to preserve the nullability. Creating thetable_input_metadata
can be a challenge, however. This PR addresses this problem by adding the ability to create atable_input_metadata
using thetable_metadata
returned byread_parquet()
.Checklist