Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044

Closed
tgravescs opened this issue Jun 3, 2022 · 4 comments · Fixed by #11160
Closed
Assignees
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@tgravescs
Copy link
Contributor

Is your feature request related to a problem? Please describe.
CUDF supports reading binary from parquet and it automatically reads it as Strings. See #10733 which fixed this.

We would like to request being able to read binary as binary and not strings.

Describe the solution you'd like
We would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.

@tgravescs tgravescs added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jun 3, 2022
@devavret
Copy link
Contributor

What is the desired output column type? There isn't a binary type in libcudf right now. If you want it to be read as a List<uint8> then it's easy to convert the output string column into it because apart from the metadata, they're the same.

@tgravescs
Copy link
Contributor Author

tgravescs commented Jun 16, 2022

Sorry did not see your question, yes binary would just be an array of bytes in Spark so List<uint8> would make sense.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented Jul 25, 2022

Still needed.

rapids-bot bot pushed a commit that referenced this issue Jul 29, 2022
There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade.

Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change.

Closes #11044 
Issue #10778

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - MithunR (https://github.com/mythrocks)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11160
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants