Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor schema generation in parquet writer #6989

Closed
devavret opened this issue Dec 11, 2020 · 3 comments · Fixed by #7461
Closed

[FEA] Refactor schema generation in parquet writer #6989

devavret opened this issue Dec 11, 2020 · 3 comments · Fixed by #7461
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

The schema generation in parquet writer has gradually improved in the last few PRs but is bound to add complexity due to the addition of input schema (#6862, #6816) and the addition of struct writing support. The schema generation should be moved out of the main write_chunk() function and into an intermediate class that bridges between parquet_column_view and cudf::column. Notably, the list_schema should be constructed by recursion rather than iteration.

@devavret devavret added feature request New feature or request Needs Triage Need team to review and classify labels Dec 11, 2020
@harrism harrism added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 13, 2020
@devavret devavret self-assigned this Dec 17, 2020
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@sameerz
Copy link
Contributor

sameerz commented Feb 18, 2021

@devavret can you confirm this is being worked on in 0.19? Related: #6816

@devavret
Copy link
Contributor Author

@devavret can you confirm this is being worked on in 0.19? Related: #6816

Yes

rapids-bot bot pushed a commit that referenced this issue Mar 19, 2021
### Adds struct writing ability to parquet writer.
The internals of the writer have been changed in the following way:
- Previously we would construct `parquet_column_view` from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options.
- The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One `struct<int, float>` column is converted into two columns: `struct<int>`, `struct<float>`. Each of these columns result in a separate `parquet_column_view` which is used only for encoding.
- In order to allow finer control to the user about the per-column options, the old metadata class is replaced by `table_input_metadata`.

#### Breaking change: Input metadata
The new input metadata class `table_input_metadata` contains a vector of `column_in_metadata` which contains a vector of `column_in_metadata`, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level.

For a table with a single struct column 
```
Struct<is_human:bool (non-nullable),
       Struct<weight:float>,
              age:int
             > (nullable)
      > (non-nullable)
```
We can set the per level names and optional nullability as follows:
```c++
cudf::io::table_input_metadata metadata(table);
metadata.column_metadata[0].set_name("being").set_nullability(false);
metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false);
metadata.column_metadata[0].child(1).set_name("particulars");
metadata.column_metadata[0].child(1).child(0).set_name("weight");
metadata.column_metadata[0].child(1).child(1).set_name("age");
```
#### Related issues
Closes #6989 
Closes #6816 
Strangely, there isn't an issue asking for struct writing support.

Authors:
  - Devavret Makkar (@devavret)
  - Kumar Aatish (@kaatish)

Approvers:
  - Vukasin Milovanovic (@vuule)
  - @nvdbaranec
  - GALI PREM SAGAR (@galipremsagar)

URL: #7461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants