-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Refactor schema generation in parquet writer #6989
Labels
Comments
devavret
added
feature request
New feature or request
Needs Triage
Need team to review and classify
labels
Dec 11, 2020
harrism
added
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
and removed
Needs Triage
Need team to review and classify
labels
Dec 13, 2020
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 19, 2021
### Adds struct writing ability to parquet writer. The internals of the writer have been changed in the following way: - Previously we would construct `parquet_column_view` from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options. - The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One `struct<int, float>` column is converted into two columns: `struct<int>`, `struct<float>`. Each of these columns result in a separate `parquet_column_view` which is used only for encoding. - In order to allow finer control to the user about the per-column options, the old metadata class is replaced by `table_input_metadata`. #### Breaking change: Input metadata The new input metadata class `table_input_metadata` contains a vector of `column_in_metadata` which contains a vector of `column_in_metadata`, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level. For a table with a single struct column ``` Struct<is_human:bool (non-nullable), Struct<weight:float>, age:int > (nullable) > (non-nullable) ``` We can set the per level names and optional nullability as follows: ```c++ cudf::io::table_input_metadata metadata(table); metadata.column_metadata[0].set_name("being").set_nullability(false); metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false); metadata.column_metadata[0].child(1).set_name("particulars"); metadata.column_metadata[0].child(1).child(0).set_name("weight"); metadata.column_metadata[0].child(1).child(1).set_name("age"); ``` #### Related issues Closes #6989 Closes #6816 Strangely, there isn't an issue asking for struct writing support. Authors: - Devavret Makkar (@devavret) - Kumar Aatish (@kaatish) Approvers: - Vukasin Milovanovic (@vuule) - @nvdbaranec - GALI PREM SAGAR (@galipremsagar) URL: #7461
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
The schema generation in parquet writer has gradually improved in the last few PRs but is bound to add complexity due to the addition of input schema (#6862, #6816) and the addition of struct writing support. The schema generation should be moved out of the main
write_chunk()
function and into an intermediate class that bridges betweenparquet_column_view
andcudf::column
. Notably, the list_schema should be constructed by recursion rather than iteration.The text was updated successfully, but these errors were encountered: