[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

tgravescs · 2022-06-03T19:27:09Z

Is your feature request related to a problem? Please describe.
CUDF supports reading binary from parquet and it automatically reads it as Strings. See #10733 which fixed this.

We would like to request being able to read binary as binary and not strings.

Describe the solution you'd like
We would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.

devavret · 2022-06-13T11:12:34Z

What is the desired output column type? There isn't a binary type in libcudf right now. If you want it to be read as a List<uint8> then it's easy to convert the output string column into it because apart from the metadata, they're the same.

tgravescs · 2022-06-16T18:58:40Z

Sorry did not see your question, yes binary would just be an array of bytes in Spark so List<uint8> would make sense.

github-actions · 2022-07-16T20:03:08Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sameerz · 2022-07-25T03:52:17Z

Still needed.

There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade. Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change. Closes #11044 Issue #10778 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Karthikeyan (https://github.com/karthikeyann) - MithunR (https://github.com/mythrocks) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #11160

tgravescs added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jun 3, 2022

tgravescs mentioned this issue Jun 3, 2022

[FEA] Support reading binary data types from Parquet as binary (not strings) NVIDIA/spark-rapids#5416

Closed

NVnavkumar mentioned this issue Jun 14, 2022

Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU NVIDIA/spark-rapids#5830

Merged

vuule assigned hyperbolic2346 Jun 16, 2022

vuule removed the Needs Triage Need team to review and classify label Jun 16, 2022

hyperbolic2346 mentioned this issue Jun 28, 2022

Adding binary read/write as options for parquet #11160

Merged

github-actions bot added the inactive-30d label Jul 16, 2022

github-actions bot removed the inactive-30d label Jul 25, 2022

rapids-bot bot closed this as completed in #11160 Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

tgravescs commented Jun 3, 2022

devavret commented Jun 13, 2022

tgravescs commented Jun 16, 2022 •

edited

Loading

github-actions bot commented Jul 16, 2022

sameerz commented Jul 25, 2022

[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

Comments

tgravescs commented Jun 3, 2022

devavret commented Jun 13, 2022

tgravescs commented Jun 16, 2022 • edited Loading

github-actions bot commented Jul 16, 2022

sameerz commented Jul 25, 2022

tgravescs commented Jun 16, 2022 •

edited

Loading