[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

hyperbolic2346 · 2020-11-20T03:08:56Z

What is your question?
It seems like there are options sneaking in to the writing for parquet such as int96 timestamps and decimal precision that indicate that the data alone is not enough to know how to write a parquet file. Cudf will probably never support all the types that are in parquet and has no real need. The software using cudf could have knowledge about the desired parquet schema that cudf has no real interest in knowing and know way of guessing properly. In those cases, it seems that it would be nice for cudf to have opinionated defaults, but support an incoming schema of how to write the data. If no schema is given, it writes data as it wants, but if the schema exists then cudf would attempt to write the data in this way. Obviously, there can be invalid schemas passed in, but I think it's completely fine to error if the schema isn't possible. Further, I think there are times when coercion is desired and times when it isn't. Something like converting decimals to floats on write could be desired or an error. I think the default should be error, but a boolean should exist in the write options to allow coercion of types where possible.

What are the thoughts on something like this?

kkraus14 · 2020-11-20T18:16:47Z

cc @vuule @devavret who have been looking at parquet pretty heavily and may have some opinions here

devavret · 2020-11-20T18:40:43Z

Passing in a schema means you primarily want coercion. Apart from int96, are there other types that we want to write that don't have a cuDF equivalent? i.e. cannot be achieved with a cuDF::cast before writing.

hyperbolic2346 · 2020-11-20T19:14:23Z

I'll let @revans2 and @jlowe chime in wish specifics, but int96 and decimal precision come immediately to mind. I think they were talking about an issue with structs or lists as well.

kkraus14 · 2020-11-20T19:26:15Z

I think being able to specify that a column is a Map column is likely another situation since libcudf current has no Map support.

jlowe · 2020-11-20T19:33:05Z

int96 and decimal precision come immediately to mind. I think they were talking about an issue with structs or lists as well.

Spark needs to be able to specify the precision of the decimal being written (we can't assume the max precision for DECIMAL64 is appropriate, for example). And @kkraus14 is correct, Spark needs to specify that a Parquet map type is being written rather than a list-of-struct-of-pairs column as cudf sees it.

revans2 · 2020-11-20T19:44:58Z

Yes, we also have a situation with storing the Binary type. We don't support reading it yet, but we need to be able to say is this a list of bytes that is not nullable or is it binary.

devavret · 2020-11-20T20:23:52Z

we also have a situation with storing the Binary type

What's the cudf column equivalent of binary type? What do you use/plan to use to write this?

jlowe · 2020-11-20T23:32:59Z

What's the cudf column equivalent of binary type?

LIST of non-nullable INT8 or UINT8. It could technically be implemented as a STRING as well, since cudf strings don't rely on string terminator characters.

What do you use/plan to use to write this?

libcudf already supports a byte_cast function that returns a LIST of UINT8 which is one way these columns could be created. Another case would be loading a column from Parquet that's encoded as a binary column and preserving that column's schema when written to another Parquet file.

hyperbolic2346 · 2020-11-30T22:48:58Z

I added a feature request for Apache Arrow's schema for parquet writing. This isn't the only possible approach, but content specific to that option should be routed there. This discussion is more an over-arching "do we need this" more than a "how would we do this", which the feature request covers.

github-actions · 2021-02-16T20:20:37Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

devavret · 2021-03-16T23:06:05Z

As per the meeting today, the input metadata added in #7461 satisfies most of the spark use cases. I'll add a closes tag to this in that PR. Feel free to edit #7461 to remove the tag if you disagree.

@devavret

### Adds struct writing ability to parquet writer. The internals of the writer have been changed in the following way: - Previously we would construct `parquet_column_view` from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options. - The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One `struct<int, float>` column is converted into two columns: `struct<int>`, `struct<float>`. Each of these columns result in a separate `parquet_column_view` which is used only for encoding. - In order to allow finer control to the user about the per-column options, the old metadata class is replaced by `table_input_metadata`. #### Breaking change: Input metadata The new input metadata class `table_input_metadata` contains a vector of `column_in_metadata` which contains a vector of `column_in_metadata`, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level. For a table with a single struct column ``` Struct<is_human:bool (non-nullable), Struct<weight:float>, age:int > (nullable) > (non-nullable) ``` We can set the per level names and optional nullability as follows: ```c++ cudf::io::table_input_metadata metadata(table); metadata.column_metadata[0].set_name("being").set_nullability(false); metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false); metadata.column_metadata[0].child(1).set_name("particulars"); metadata.column_metadata[0].child(1).child(0).set_name("weight"); metadata.column_metadata[0].child(1).child(1).set_name("age"); ``` #### Related issues Closes #6989 Closes #6816 Strangely, there isn't an issue asking for struct writing support. Authors: - Devavret Makkar (@devavret) - Kumar Aatish (@kaatish) Approvers: - Vukasin Milovanovic (@vuule) - @nvdbaranec - GALI PREM SAGAR (@galipremsagar) URL: #7461

hyperbolic2346 added question Further information is requested Needs Triage Need team to review and classify labels Nov 20, 2020

kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 20, 2020

jlowe added the Spark Functionality that helps Spark RAPIDS label Nov 20, 2020

hyperbolic2346 mentioned this issue Nov 30, 2020

[FEA] Add Apache Arrow schema for parquet writing #6862

Closed

devavret mentioned this issue Dec 11, 2020

[FEA] Refactor schema generation in parquet writer #6989

Closed

github-actions bot added the stale label Feb 16, 2021

devavret mentioned this issue Mar 16, 2021

Add struct support to parquet writer #7461

Merged

rapids-bot bot closed this as completed in #7461 Mar 19, 2021

vuule mentioned this issue Jun 4, 2021

[FEA] Support an incoming schema for ORC writing #8443

Closed

lidavidm mentioned this issue Nov 8, 2023

[FEA] Respect set_output_as_binary in ORC writer #14375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

hyperbolic2346 commented Nov 20, 2020

kkraus14 commented Nov 20, 2020

devavret commented Nov 20, 2020

hyperbolic2346 commented Nov 20, 2020

kkraus14 commented Nov 20, 2020

jlowe commented Nov 20, 2020

revans2 commented Nov 20, 2020

devavret commented Nov 20, 2020

jlowe commented Nov 20, 2020 •

edited

Loading

hyperbolic2346 commented Nov 30, 2020

github-actions bot commented Feb 16, 2021

devavret commented Mar 16, 2021

[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

Comments

hyperbolic2346 commented Nov 20, 2020

kkraus14 commented Nov 20, 2020

devavret commented Nov 20, 2020

hyperbolic2346 commented Nov 20, 2020

kkraus14 commented Nov 20, 2020

jlowe commented Nov 20, 2020

revans2 commented Nov 20, 2020

devavret commented Nov 20, 2020

jlowe commented Nov 20, 2020 • edited Loading

hyperbolic2346 commented Nov 30, 2020

github-actions bot commented Feb 16, 2021

devavret commented Mar 16, 2021

jlowe commented Nov 20, 2020 •

edited

Loading