-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Parquet support for reading binary and repeated binary as binary not strings #11044
Comments
What is the desired output column type? There isn't a binary type in libcudf right now. If you want it to be read as a |
Sorry did not see your question, yes binary would just be an array of bytes in Spark so |
This issue has been labeled |
Still needed. |
There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade. Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change. Closes #11044 Issue #10778 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Karthikeyan (https://github.com/karthikeyann) - MithunR (https://github.com/mythrocks) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #11160
Is your feature request related to a problem? Please describe.
CUDF supports reading binary from parquet and it automatically reads it as Strings. See #10733 which fixed this.
We would like to request being able to read binary as binary and not strings.
Describe the solution you'd like
We would like cudf to officially support reading these binary and repeated binary types. Ideally we could support reading binary both as binary and as strings. We could pass in a read schema so the reader would know what to read it is.
The text was updated successfully, but these errors were encountered: