-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816
Comments
Passing in a schema means you primarily want coercion. Apart from int96, are there other types that we want to write that don't have a cuDF equivalent? i.e. cannot be achieved with a |
I think being able to specify that a column is a |
Spark needs to be able to specify the precision of the decimal being written (we can't assume the max precision for DECIMAL64 is appropriate, for example). And @kkraus14 is correct, Spark needs to specify that a Parquet map type is being written rather than a list-of-struct-of-pairs column as cudf sees it. |
Yes, we also have a situation with storing the Binary type. We don't support reading it yet, but we need to be able to say is this a list of bytes that is not nullable or is it binary. |
What's the cudf column equivalent of binary type? What do you use/plan to use to write this? |
LIST of non-nullable INT8 or UINT8. It could technically be implemented as a STRING as well, since cudf strings don't rely on string terminator characters.
libcudf already supports a |
I added a feature request for Apache Arrow's schema for parquet writing. This isn't the only possible approach, but content specific to that option should be routed there. This discussion is more an over-arching "do we need this" more than a "how would we do this", which the feature request covers. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
### Adds struct writing ability to parquet writer. The internals of the writer have been changed in the following way: - Previously we would construct `parquet_column_view` from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options. - The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One `struct<int, float>` column is converted into two columns: `struct<int>`, `struct<float>`. Each of these columns result in a separate `parquet_column_view` which is used only for encoding. - In order to allow finer control to the user about the per-column options, the old metadata class is replaced by `table_input_metadata`. #### Breaking change: Input metadata The new input metadata class `table_input_metadata` contains a vector of `column_in_metadata` which contains a vector of `column_in_metadata`, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level. For a table with a single struct column ``` Struct<is_human:bool (non-nullable), Struct<weight:float>, age:int > (nullable) > (non-nullable) ``` We can set the per level names and optional nullability as follows: ```c++ cudf::io::table_input_metadata metadata(table); metadata.column_metadata[0].set_name("being").set_nullability(false); metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false); metadata.column_metadata[0].child(1).set_name("particulars"); metadata.column_metadata[0].child(1).child(0).set_name("weight"); metadata.column_metadata[0].child(1).child(1).set_name("age"); ``` #### Related issues Closes #6989 Closes #6816 Strangely, there isn't an issue asking for struct writing support. Authors: - Devavret Makkar (@devavret) - Kumar Aatish (@kaatish) Approvers: - Vukasin Milovanovic (@vuule) - @nvdbaranec - GALI PREM SAGAR (@galipremsagar) URL: #7461
What is your question?
It seems like there are options sneaking in to the writing for parquet such as int96 timestamps and decimal precision that indicate that the data alone is not enough to know how to write a parquet file. Cudf will probably never support all the types that are in parquet and has no real need. The software using cudf could have knowledge about the desired parquet schema that cudf has no real interest in knowing and know way of guessing properly. In those cases, it seems that it would be nice for cudf to have opinionated defaults, but support an incoming schema of how to write the data. If no schema is given, it writes data as it wants, but if the schema exists then cudf would attempt to write the data in this way. Obviously, there can be invalid schemas passed in, but I think it's completely fine to error if the schema isn't possible. Further, I think there are times when coercion is desired and times when it isn't. Something like converting decimals to floats on write could be desired or an error. I think the default should be error, but a boolean should exist in the write options to allow coercion of types where possible.
What are the thoughts on something like this?
The text was updated successfully, but these errors were encountered: