Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

Closed
hyperbolic2346 opened this issue Nov 20, 2020 · 11 comments · Fixed by #7461
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested Spark Functionality that helps Spark RAPIDS

Comments

@hyperbolic2346
Copy link
Contributor

What is your question?
It seems like there are options sneaking in to the writing for parquet such as int96 timestamps and decimal precision that indicate that the data alone is not enough to know how to write a parquet file. Cudf will probably never support all the types that are in parquet and has no real need. The software using cudf could have knowledge about the desired parquet schema that cudf has no real interest in knowing and know way of guessing properly. In those cases, it seems that it would be nice for cudf to have opinionated defaults, but support an incoming schema of how to write the data. If no schema is given, it writes data as it wants, but if the schema exists then cudf would attempt to write the data in this way. Obviously, there can be invalid schemas passed in, but I think it's completely fine to error if the schema isn't possible. Further, I think there are times when coercion is desired and times when it isn't. Something like converting decimals to floats on write could be desired or an error. I think the default should be error, but a boolean should exist in the write options to allow coercion of types where possible.

What are the thoughts on something like this?

@hyperbolic2346 hyperbolic2346 added question Further information is requested Needs Triage Need team to review and classify labels Nov 20, 2020
@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 20, 2020
@kkraus14
Copy link
Collaborator

cc @vuule @devavret who have been looking at parquet pretty heavily and may have some opinions here

@devavret
Copy link
Contributor

Passing in a schema means you primarily want coercion. Apart from int96, are there other types that we want to write that don't have a cuDF equivalent? i.e. cannot be achieved with a cuDF::cast before writing.

@hyperbolic2346
Copy link
Contributor Author

I'll let @revans2 and @jlowe chime in wish specifics, but int96 and decimal precision come immediately to mind. I think they were talking about an issue with structs or lists as well.

@kkraus14
Copy link
Collaborator

I think being able to specify that a column is a Map column is likely another situation since libcudf current has no Map support.

@jlowe
Copy link
Member

jlowe commented Nov 20, 2020

int96 and decimal precision come immediately to mind. I think they were talking about an issue with structs or lists as well.

Spark needs to be able to specify the precision of the decimal being written (we can't assume the max precision for DECIMAL64 is appropriate, for example). And @kkraus14 is correct, Spark needs to specify that a Parquet map type is being written rather than a list-of-struct-of-pairs column as cudf sees it.

@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Nov 20, 2020
@revans2
Copy link
Contributor

revans2 commented Nov 20, 2020

Yes, we also have a situation with storing the Binary type. We don't support reading it yet, but we need to be able to say is this a list of bytes that is not nullable or is it binary.

@devavret
Copy link
Contributor

we also have a situation with storing the Binary type

What's the cudf column equivalent of binary type? What do you use/plan to use to write this?

@jlowe
Copy link
Member

jlowe commented Nov 20, 2020

What's the cudf column equivalent of binary type?

LIST of non-nullable INT8 or UINT8. It could technically be implemented as a STRING as well, since cudf strings don't rely on string terminator characters.

What do you use/plan to use to write this?

libcudf already supports a byte_cast function that returns a LIST of UINT8 which is one way these columns could be created. Another case would be loading a column from Parquet that's encoded as a binary column and preserving that column's schema when written to another Parquet file.

@hyperbolic2346
Copy link
Contributor Author

I added a feature request for Apache Arrow's schema for parquet writing. This isn't the only possible approach, but content specific to that option should be routed there. This discussion is more an over-arching "do we need this" more than a "how would we do this", which the feature request covers.

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@devavret
Copy link
Contributor

As per the meeting today, the input metadata added in #7461 satisfies most of the spark use cases. I'll add a closes tag to this in that PR. Feel free to edit #7461 to remove the tag if you disagree.

rapids-bot bot pushed a commit that referenced this issue Mar 19, 2021
### Adds struct writing ability to parquet writer.
The internals of the writer have been changed in the following way:
- Previously we would construct `parquet_column_view` from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options.
- The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One `struct<int, float>` column is converted into two columns: `struct<int>`, `struct<float>`. Each of these columns result in a separate `parquet_column_view` which is used only for encoding.
- In order to allow finer control to the user about the per-column options, the old metadata class is replaced by `table_input_metadata`.

#### Breaking change: Input metadata
The new input metadata class `table_input_metadata` contains a vector of `column_in_metadata` which contains a vector of `column_in_metadata`, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level.

For a table with a single struct column 
```
Struct<is_human:bool (non-nullable),
       Struct<weight:float>,
              age:int
             > (nullable)
      > (non-nullable)
```
We can set the per level names and optional nullability as follows:
```c++
cudf::io::table_input_metadata metadata(table);
metadata.column_metadata[0].set_name("being").set_nullability(false);
metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false);
metadata.column_metadata[0].child(1).set_name("particulars");
metadata.column_metadata[0].child(1).child(0).set_name("weight");
metadata.column_metadata[0].child(1).child(1).set_name("age");
```
#### Related issues
Closes #6989 
Closes #6816 
Strangely, there isn't an issue asking for struct writing support.

Authors:
  - Devavret Makkar (@devavret)
  - Kumar Aatish (@kaatish)

Approvers:
  - Vukasin Milovanovic (@vuule)
  - @nvdbaranec
  - GALI PREM SAGAR (@galipremsagar)

URL: #7461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants